In this project, we explore modelling Purpose-built Student Accommodation ("PBSA") weekly rental rates in the UK for the AY24/25 by considering a range of factors.
We have a dataset that details the weekly rate being charged for different sub-classifications of room types for different assets across the UK. Our aim is to build a model that can predict the rental rate to be charged for a specific room type in a specific asset based upon the features we have data for.
Other features we have include the city, postcode, operator, tenancy length, and rental information for previous academic years. The dataset also contains slightly intermittent data for typical room sizes too.
We will start by importing the data and refining it, given its sporadic nature. We will then perform some Exploratory Data Analysis ("EDA") to better understand our data. We will then assess the data's suitability for a Linear Regression model, which shall be the first model we try, given its simple and interpretable nature. Then we shall engineer our features to ensure the data is ready for the model to be trained upon it, before training the Linear Regression model upon the data and assessing the accuracy of this model. We shall then consider two further modelling approaches, including a Random Forest and a Gradient Boosting model, before refining a chosen model and evaluating the final, iterated version.
In this section, we shall import the data and begin cleaning it. We shall look to refine the dataset to ensure it does not include any defunct information, before correcting obvious errors and dealing with missing values. Our aim here is to prepare the dataset so it is ready for EDA. We shall also ensure there is consistency in our data, which shall include checking for duplicates, amongst other things.
We start by importing the necessary modules we shall need for this section.
import pandas as pd
import numpy as np
We now import the data into a Pandas DataFrame and do some initial analysis on the first five rows as well as examine the columns we have and the corresponding data types for each column.
rental_data = pd.read_csv("PBSA Rental Data.csv", encoding = "ISO-8859-1")
#We need the encoding as there are £ signs in the data
print(rental_data.head())
print(rental_data.info())
Dataset Issue ID Property Name Operator Address Line 1 \
0 Wave One 8540 The Combworks Aparto Student The Combworks
1 Wave One 8540 The Combworks Aparto Student The Combworks
2 Wave One 8540 The Combworks Aparto Student The Combworks
3 Wave One 8540 The Combworks Aparto Student The Combworks
4 Wave One 8540 The Combworks Aparto Student The Combworks
Address Line 2 City Postcode \
0 455 George Street Aberdeen AB25 3YB
1 455 George Street Aberdeen AB25 3YB
2 455 George Street Aberdeen AB25 3YB
3 455 George Street Aberdeen AB25 3YB
4 455 George Street Aberdeen AB25 3YB
Total Beds - Revised with any new intelligence \
0 134
1 134
2 134
3 134
4 134
Facilities ... \
0 All Utility Bills Included, Fully Fitted Kitch... ...
1 All Utility Bills Included, Fully Fitted Kitch... ...
2 All Utility Bills Included, Fully Fitted Kitch... ...
3 All Utility Bills Included, Fully Fitted Kitch... ...
4 All Utility Bills Included, Fully Fitted Kitch... ...
Property Name (Repeat) Operator (Repeat) 24-25 Capture Date 24-25 Notes \
0 The Combworks Aparto Student 20.11.23 NaN
1 The Combworks Aparto Student 20.11.23 NaN
2 The Combworks Aparto Student 20.11.23 NaN
3 The Combworks Aparto Student 20.11.23 NaN
4 The Combworks Aparto Student 20.11.23 NaN
23-24 Notes 22-23 Notes Number Of Rooms of Type \
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
Room Size - Revised with any new intelligence Unnamed: 37 Unnamed: 38
0 15.5 m2 NaN 12.0
1 15.5 m2 NaN 12.0
2 19.1 m2 NaN 12.0
3 19.1 m2 NaN 12.0
4 24.8 m2 NaN 12.0
[5 rows x 39 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12426 entries, 0 to 12425
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Dataset Issue 12408 non-null object
1 ID 12408 non-null object
2 Property Name 12404 non-null object
3 Operator 12405 non-null object
4 Address Line 1 12365 non-null object
5 Address Line 2 10882 non-null object
6 City 12407 non-null object
7 Postcode 12407 non-null object
8 Total Beds - Revised with any new intelligence 12386 non-null object
9 Facilities 12403 non-null object
10 Internet Speed 10444 non-null object
11 Build Date 10660 non-null object
12 Total Studios 2370 non-null object
13 Room Type 12413 non-null object
14 Sub Classification 12110 non-null object
15 22-23 Capture Date 11622 non-null object
16 22-23 Tenancy Period (weeks) 11654 non-null object
17 22-23 Price Per Week (£) 11633 non-null object
18 23-24 £ 10623 non-null object
19 Increase £ on 22-23 11109 non-null object
20 Increase % on 22-23 11113 non-null object
21 23-24 Period 11191 non-null object
22 23-24 Capture Date 11949 non-null object
23 24-25 £ 9749 non-null object
24 24-25 Period 11373 non-null object
25 £ Increase on 23-24 11741 non-null object
26 % Increase on 23-24 11741 non-null object
27 Room Type (Repeat) 12396 non-null object
28 Sub Classification (Repeat) 12110 non-null object
29 Property Name (Repeat) 12404 non-null object
30 Operator (Repeat) 12405 non-null object
31 24-25 Capture Date 12138 non-null object
32 24-25 Notes 5487 non-null object
33 23-24 Notes 5493 non-null object
34 22-23 Notes 2193 non-null object
35 Number Of Rooms of Type 2960 non-null object
36 Room Size - Revised with any new intelligence 7615 non-null object
37 Unnamed: 37 0 non-null float64
38 Unnamed: 38 12408 non-null float64
dtypes: float64(2), object(37)
memory usage: 3.7+ MB
None
As we can see, the DataFrame rental_data comprises $12,426$ rows and $39$ columns. All of the columns, except for the last two, are of the object data type. We also note that each column has differing numbers of non-null values, suggesting that different rows may be missing different parts of data.
We want to work with a smaller version of this DataFrame that only includes the columns we are interested in. We seek to remove all columns with defunct information, such as 'Dataset Issue' and 'ID', as well as patchy information like 'Room Size - Revised with any new intelligence', and information pertaining to other academic years. We will also leave just one geographic variable that we shall consider to avoid multicollinearity. In this case we keep 'city'.
Finally, we drop 'Sub Classification'. This is a useful piece of information that tells us the differing sub-types of room types which results in a range of prices for one room type in any one asset. However, different operators use different classifications, such as 'Bronze', 'Silver', 'Gold' or 'Standard', 'Premium', 'Premium Plus', 'Luxe' etc and this different terminology makes it impossible to truly aggregate this data on a like-for-like basis. So for the purposes of this study, we have elected to remove this level of granularity.
columns_to_keep = [2, 3, 6, 8, 11, 13, 23]
all_columns = list(range(rental_data.shape[1]))
columns_to_drop = [i for i in all_columns if i not in columns_to_keep]
rental_data = rental_data.drop(rental_data.columns[columns_to_drop], axis = 1)
column_names = rental_data.columns.tolist()
print(column_names)
['Property Name', 'Operator ', 'City', 'Total Beds - Revised with any new intelligence ', 'Build Date', 'Room Type ', '24-25 £']
Above we see the names of the columns we have remaining. We now seek to standardise these names.
column_names_dict = {
"Property Name" : "asset",
"Operator " : "operator",
"City" : "city",
"Build Date" : "build_date",
"Total Beds - Revised with any new intelligence " : "beds",
"Room Type " : "room_type",
"24-25 £" : "weekly_rent",
}
rental_data = rental_data.rename(columns = column_names_dict)
print(rental_data.head())
asset operator city beds build_date room_type \ 0 The Combworks Aparto Student Aberdeen 134 2018 Studio 1 The Combworks Aparto Student Aberdeen 134 2018 Studio 2 The Combworks Aparto Student Aberdeen 134 2018 Studio 3 The Combworks Aparto Student Aberdeen 134 2018 Studio 4 The Combworks Aparto Student Aberdeen 134 2018 Studio weekly_rent 0 £204.00 1 £194.00 2 £229.00 3 £219.00 4 £245.00
We now look to start cleaning the data we have.
We start with room types, where want to standardise the classifications.
unique_room_types = rental_data.room_type.unique()
for room_type in unique_room_types:
print(room_type)
Studio En-Suite Non En-Suite En-suite Non En-suite One Bed Apartment Dual Studio One Bedroom Apartment Twin nan Duplex Apartment Duplex Apartment Dual-Studio Ensuite Premium Plus Standard Plus Non En-suite 331 0
As we can see, there are different variations of the same room types, as well as nonsensical values such as $0$ and $331$.
Below we create a list of what we want to map each of these unique values to, before creating a dictionary of the two lists and using the map() function to amend the data in rental_data.
We have chosen the following classifications: 'Studio', 'En-Suite', 'Non En-Suite', 'One Bed' and 'Twin'. Some clearly missing values are classified as NaN.
updated_room_types = [
"Studio", "En-Suite", "Non En-Suite",
"En-Suite", "Non En-Suite", "One Bed",
"En-Suite", "One Bed", "Twin", np.nan,
np.nan, "One Bed", "One Bed", "En-Suite",
"En-Suite", np.nan, np.nan, "Non En-Suite",
np.nan, np.nan
]
room_type_dict = dict(zip(unique_room_types, updated_room_types))
rental_data["room_type"] = rental_data["room_type"].map(room_type_dict)
We note that we cannot use rows where this data is NaN. Furthermore, the weekly_rent column for the 'Twin' rooms can be misleading as it sometimes states a value to be paid on a per person basis and other times on an entire room basis. For these reasons, we remove the rows where the room type is one of NaN or 'Twin'.
remove_room_types = [np.nan, "Twin"]
rental_data.drop(rental_data[rental_data["room_type"].isin(remove_room_types)].index, inplace = True)
We now consider the operators and repeat a similar exercise.
unique_operators = rental_data.operator.unique()
for operator in unique_operators:
print(operator)
Aparto Student Every Student Granite City Developments LLP Hello Student HFS - Homes For Students HFS - Prestige Student Living Mezzino Now Students Student Roost u-student UNITE Students Bellevue Student Limited HFS - Universal Student Living iQ Student Accommodation Project Student Fresh Student Living Host Students Student Castle HFS - Homes For Students LIVStudent Novel Student Quest Student Management Vita Student Allied Student Accommodation Almero Student Mansions Campus Living Villages Canvas Student Collegiate AC Evenbrook Group Here Students HFS - Home For Students Luna Students - Name Change From Torsion Students Pennycuick Collins Premier Student Halls Prime Student Living Purple Frog Property Ltd Student Letting Company True Student UniHouse Volume Property Yugo Primo Property Management Propeller Lettings - Was Cloud Student Homes ASN Capital CRM Students Lulworth Student Company Cloud Student Homes Kexgill Student Accommodation Sanctuary Student Abodus Student Living iQ Student Accommodation Dwell Student Living HFS - Prestige Student Living Student Castle Property Management Service Study Inn Abodus Student Living - was Nido Derwent Students Downing Students HFS - Urban Student Life CPS Homes Key Let Unest Student Facility Management Xenia Students Beyond The Box Student Ltd Future Generation Asset Management Limited Axo Student Living Code Student Accommodation CRM Students HFS - Universal Student Living - Was Luna Students - Name Change From Torsion Students Was - Axo Student Living CRM Mansion Student Mears Group West One Property Management & Factoring Ltd Student Cribs Unite Students East Of Exe City Estates The Social Hub Future Generation Asset Management Limited HFS - Essential Student Living Scape Student Living UniLife DIGS Student Vanilla Lettings Ashcourt Bailrigg Student Living City Block Yellow Door Lets Here Students HFS - Essential Living IconInc N Joy Student Living Park Lane Properties Samara Homes Spencer Properties Unipol YPP Bee Hive - Harington Investments Sodexo Student Living nan TBC - Was Hello Student APS Property Group Ltd Gather Students Alexander Student Property Group Aspenhawk Limited Caro Student Living Condor Properties Mapleisle Ltd McComb Property Company Ltd Orange Liverpool Ltd T/A Loc8me Stockton Students Urban Sleep Was Hello Student X1 Lettings Chapter London Find Digs Gather Students Ltd Project Student Smart Student Accommodation Student Management Services Urbanest Future Generation Asset Management Limited Stanton Asset Management Apex Student Living Ashley Educational Trust Hartley Hall of Residence Host Students Graysons Properties North Eastern YWCA Carvels Lettings Heathfield Norwich Limited HFS - Evo Student Luna Students - Was Torsion Students Manor Villages Megaclose Ltd Oak Student Lets Oak Student Letts Student Living UNIPOL Aspire Student Lettings Metro Student Accommodation Stay Clever Days Letting Fenton Property Holdings Nurtur Student Living Portergate Property Management T J Thomas Warehouse Students Ltd Bagri Foundation HFS - Homes for Students Loddon House Student Living Campbell Property Cloud-Student-Homes SPACE Student Accomodation Dog & Bone Properties Ltd Empire House Student Halls Stoke Student Living Dawsons Digs Swansea Lettings StudentDigz Living Worcester Group
updated_operators = [
'Aparto Student',
'Every Student',
'Granite City Developments LLP',
'Hello Student',
'HFS - Homes For Students',
'HFS - Prestige Student Living',
'Mezzino',
'Now Students',
'Student Roost',
'u-student',
'UNITE Students',
'Bellevue Student Limited',
'HFS - Universal Student Living',
'iQ Student Accommodation',
'Project Student',
'Fresh Student Living',
'Host Students',
'Student Castle',
'HFS - Homes For Students',
'LIVStudent',
'Novel Student',
'Quest Student Management',
'Vita Student',
'Allied Student Accommodation',
'Almero Student Mansions',
'Campus Living Villages',
'Canvas Student',
'Collegiate AC',
'Evenbrook Group',
'Here Students',
'HFS - Homes For Students',
'Luna Students',
'Pennycuick Collins',
'Premier Student Halls',
'Prime Student Living',
'Purple Frog Property Ltd',
'Student Letting Company',
'True Student',
'UniHouse',
'Volume Property',
'Yugo',
'Primo Property Management',
'Propeller Lettings',
'ASN Capital',
'CRM Students',
'Lulworth Student Company',
'Cloud Student Homes',
'Kexgill Student Accommodation',
'Sanctuary Student',
'Abodus Student Living',
'iQ Student Accommodation',
'Dwell Student Living',
'HFS - Prestige Student Living',
'Student Castle',
'Study Inn',
'Abodus Student Living',
'Derwent Students',
'Downing Students',
'HFS - Urban Student Life',
'CPS Homes',
'Key Let',
'Unest',
'Student Facility Management',
'Xenia Students',
'Beyond The Box Student Ltd',
'Future Generation Asset Management Limited',
'Axo Student Living',
'Code Student Accommodation',
'CRM Students',
'HFS - Universal Student Living',
np.nan,
'CRM Students',
'Mansion Student',
'Mears Group',
'West One Property Management & Factoring Ltd',
'Student Cribs',
'UNITE Students',
'East Of Exe',
'City Estates',
'The Social Hub',
'Future Generation Asset Management Limited',
'HFS - Essential Student Living',
'Scape Student Living',
'UniLife',
'DIGS Student',
'Vanilla Lettings',
'Ashcourt',
'Bailrigg Student Living',
'City Block',
'Yellow Door Lets',
'Here Students',
'HFS - Essential Student Living',
'IconInc',
'N Joy Student Living',
'Park Lane Properties',
'Samara Homes',
'Spencer Properties',
'Unipol',
'YPP',
'Bee Hive - Harington Investments',
'Sodexo Student Living',
np.nan,
np.nan,
'APS Property Group Ltd',
'Gather Students',
'Alexander Student Property Group',
'Aspenhawk Limited',
'Caro Student Living',
'Condor Properties',
'Mapleisle Ltd',
'McComb Property Company Ltd',
'Orange Liverpool Ltd T/A Loc8me',
'Stockton Students',
'Urban Sleep',
np.nan,
'X1 Lettings ',
'Chapter London',
'Find Digs ',
'Gather Students',
'Project Student',
'Smart Student Accommodation',
'Student Management Services',
'Urbanest',
'Future Generation Asset Management Limited',
'Stanton Asset Management',
'Apex Student Living',
'Ashley Educational Trust',
'Hartley Hall of Residence',
'Host Students',
'Graysons Properties',
'North Eastern YWCA',
'Carvels Lettings',
'Heathfield Norwich Limited',
'HFS - Evo Student',
'Luna Students',
'Manor Villages',
'Megaclose Ltd',
'Oak Student Lets',
'Oak Student Lets',
'Student Living',
'Unipol',
'Aspire Student Lettings',
'Metro Student Accommodation',
'Stay Clever',
'Days Letting',
'Fenton Property Holdings',
'Nurtur Student Living',
'Portergate Property Management',
'T J Thomas',
'Warehouse Students Ltd',
'Bagri Foundation',
'HFS - Homes For Students',
'Loddon House Student Living',
'Campbell Property',
'Cloud Student Homes',
'SPACE Student Accomodation',
'Dog & Bone Properties Ltd ',
'Empire House Student Halls',
'Stoke Student Living',
'Dawsons',
'Digs Swansea Lettings',
'StudentDigz',
'Living Worcester Group'
]
operator_dict = dict(zip(unique_operators, updated_operators))
rental_data["operator"] = rental_data["operator"].map(operator_dict)
rental_data.dropna(subset = ["operator"], inplace = True)
rental_data["operator"] = rental_data["operator"].astype("string")
We repeat the process for city and note that there are no duplicates and that the names are already standardised.
unique_cities = rental_data.city.unique()
for city in unique_cities:
print(city)
Aberdeen Aberystwyth Bangor Bath Bedford Belfast Birmingham Bolton Bournemouth Bradford Brighton Bristol Cambridge Canterbury Cardiff Carlisle Cheltenham Chester Colchester Coventry Derby Dundee Durham Edinburgh Exeter Falmouth Glasgow Guildford Huddersfield Hull Ipswich Kingston upon Thames Lancaster Leeds Leicester Lincoln Liverpool London Loughborough Luton Manchester Medway Newcastle Newport (Wales) Norwich Nottingham Oxford Paisley Plymouth Portsmouth Preston Reading Salford Sheffield Southampton St Andrews Stirling Stockton & Middlesbrough Stoke On Trent Sunderland Swansea Warwick Winchester Wolverhampton Worcester Wrexham York
We repeat the exercise for beds and check the minimum is above 0. We then drop the rows with non-numerical and missing values.
unique_beds = rental_data.beds.unique()
for number in unique_beds:
print(number)
134 171 147 178 131 150 273 45 56 123 196 173 130 26 36 72 512 77 199 360 618 222 399 511 254 38 30 96 200 97 18 383 60 335 20 169 78 31 40 144 330 94 180 104 331 517 156 717 462 253 407 474 413 430 217 401 83 240 604 909 463 1048 600 106 290 337 609 103 140 209 1025 62 267 120 434 460 432 435 154 146 48 132 50 73 398 70 259 417 420 596 647 677 656 586 47 534 520 28 127 221 84 65 102 206 100 486 308 470 519 550 403 504 1026 755 108 195 51 183 129 197 71 351 555 378 301 166 348 479 157 499 93 153 75 99 86 87 88 225 361 98 137 210 300 246 431 270 355 251 212 219 386 540 79 607 117 168 491 374 686 112 90 136 544 675 477 67 314 644 384 350 410 632 380 288 334 643 281 39 133 323 128 121 402 53 282 229 779 252 1040 778 385 1206 266 307 439 167 155 449 737 391 453 262 344 95 286 353 823 115 436 66 505 614 961 464 780 496 126 326 152 116 498 191 409 69 194 276 109 233 362 473 190 358 110 148 238 151 356 257 170 135 247 74 138 325 249 396 43 59 272 581 260 278 237 250 205 421 64 82 203 159 61 599 226 268 218 492 124 29 113 312 111 145 416 174 310 701 588 536 91 89 400 458 422 264 440 349 315 465 181 405 232 501 122 346 553 1381 179 471 653 256 427 81 935 590 214 631 239 443 551 514 85 476 76 54 411 211 752 634 978 302 298 52 320 22 161 445 25 613 564 563 533 964 497 976 376 19 255 943 601 500 nan 664 274 669 107 17 80 284 143 182 369 68 690 101 220 425 1329 569 24 289 37 158 516 261 404 735 592 412 390 475 535 317 192 294 16 34 354 231 271 999 928 1236 1085 248 635 776 280 160 63 149 TBC - Re CLV 824 283 424 574 950 611 482 1117 306 450 802 333 263 328 184 699 393 32 55 295 92 198 185 347 805 187 673 365 704 841 244 35 770 541 578 176 527 188 687 617 245 694 920 674 528 698 142 186 329 452 572 1001 852 654 657 305 177 454 1100 758 230 42 105 316 792 1017 561 118 433 164 27 46 603 529 438 321 712 671 729 279 1106 1469 207 444 543 313 236 275 277 371 418 345 526 575 695 213 332 442 742 44 739 437 483 693 162 292 215 125 522 114 297 547 472 1096 800 598 484 808 515 887 193 235 576 309 836 480 726 1000 849 175 234 395 141 1335 242 228 457 243 860 666 455 324 319 224 691 767 397 389 972 139 992 41 446 163 366 423 204 507 562 467 241 645 557 370 967 706 296 650 357 119
rental_data.drop(rental_data[rental_data["beds"] == "TBC - Re CLV "].index, inplace = True)
rental_data.dropna(subset = ["beds"], inplace = True)
rental_data["beds"] = rental_data["beds"].astype(int)
print('The smallest asset has ' + str(rental_data.beds.min()) + ' beds.')
The smallest asset has 16 beds.
We now turn our attention to build_date.
unique_build_dates = rental_data.build_date.unique()
for year in unique_build_dates:
print(year)
2018 1992 2003 2001 nan 2014 2016 2017 2023 1997 1999 2006 2015 2007 2024 2012 2022 2009 2004 2021 2019 2020 2011 + 2021 (Understood to be 16 annex rooms) 2013 2011 1993 2008 199 2005 2010 2000 2002 1996 1995 2005 & 2007 2006 & 2007 2008 & 2007 2007 & 2007 2009 & 2007 2011 & 2007 2010 & 2007 2012 & 2007 201 1995 - Refurb 2022 1994 2094 1965 1950 2022 - Refurb 1998 - Refurb 2017 1998 2001 & 2022 Refurbished for 2023 2002 & 2005 2002 & 2005 2016 & 2020
As we can see, some of the data contains two years in the event of there being a refurbishment. There are also years stated as $201$, $2094$, and $199$, which appear to be incorrect or missing a digit. Below we investigate which assets these years relate to and see if we can manually find the correct year using other sources.
rental_data[rental_data["build_date"].isin(['199', '201', '2094'])]
| asset | operator | city | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|
| 946 | Oak Brook Park | UNITE Students | Birmingham | 656 | 199 | En-Suite | NaN |
| 947 | Oak Brook Park | UNITE Students | Birmingham | 656 | 199 | En-Suite | NaN |
| 4474 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | NaN |
| 4475 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | NaN |
| 4476 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £127.00 |
| 4477 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £122.00 |
| 4478 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | NaN |
| 4479 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | NaN |
| 4480 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £134.00 |
| 4481 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £129.00 |
| 4482 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | NaN |
| 4483 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | NaN |
| 4484 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £143.00 |
| 4485 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £138.00 |
| 4486 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £146.00 |
| 4487 | Snow Island | Student Castle | Huddersfield | 427 | 201 | En-Suite | £141.00 |
| 5319 | 134 New Walk | Hello Student | Leicester | 20 | 2094 | Studio | £181.00 |
| 5320 | 134 New Walk | Hello Student | Leicester | 20 | 2094 | Studio | £186.00 |
| 5321 | 134 New Walk | Hello Student | Leicester | 20 | 2094 | Studio | NaN |
| 5322 | 134 New Walk | Hello Student | Leicester | 20 | 2094 | Studio | NaN |
| 5323 | 134 New Walk | Hello Student | Leicester | 20 | 2094 | Studio | £179.00 |
| 10678 | X1 The Campus | X1 Lettings | Salford | 272 | 201 | Studio | £200.00 |
| 10679 | X1 The Campus | X1 Lettings | Salford | 272 | 201 | Studio | £249.00 |
| 10680 | X1 The Campus | X1 Lettings | Salford | 272 | 201 | Studio | £190.00 |
We note that Oak Brook Park has no weekly_rent data and therefore is likely to be dropped before modelling. That leaves Snow Island, 134 New Walk, and X1 The Campus.
External sources reveal that X1 The Campus was completed in 2018, Snow Island completed in 2012, and 134 New Walk was refurbished in 2016. We will amend these in the dataset.
Below we replace the years with accurate ones. We have labelled $201$ as $2018$ to fix the X1 The Campus figure and will independently change Snow Island's build date after. We have also labelled $2094$ as $2016$ for 134 New Walk. In the case of refurbishment, we select the newer date as the date is supposed to represent a proxy for the specification of the property. Ones that could not be deciphered were labelled as np.nan. We note that we have labelled Oak Brook Park as np.nan given the year is inconsequential as the data point will be removed later, given that there is no weekly_rent data.
new_build_dates = [
'2018',
'1992',
'2003',
'2001',
np.nan,
'2014',
'2016',
'2017',
'2023',
'1997',
'1999',
'2006',
'2015',
'2007',
'2024',
'2012',
'2022',
'2009',
'2004',
'2021',
'2019',
'2020',
'2011',
'2013',
'2011',
'1993',
'2008',
'2023',
'2005',
'2010',
'2000',
'2002',
'1996',
'1995',
'2007',
'2007',
'2007',
'2007',
'2009',
'2011',
'2010',
'2012',
'2018',
'2022',
'1994',
'2016',
'1965',
'1950',
'2022',
'2017',
'1998',
'2022',
'2023',
'2005',
'2005',
'2020']
build_date_dict = dict(zip(unique_build_dates, new_build_dates))
rental_data["build_date"] = rental_data["build_date"].map(build_date_dict)
rental_data.loc[rental_data['asset'] == 'Snow Island', 'build_date'] = 2012
rental_data.dropna(subset = ["build_date"], inplace = True)
rental_data["build_date"] = rental_data["build_date"].astype(int)
We now consider the independent variable - weekly_rent. Below we replace the whitespace numbers with NaN using regex before removing the '£' sign and the ',' from the numbers. This gives us the ability to convert the column into a float data type.
rental_data.weekly_rent = rental_data.weekly_rent.replace(r"^\s*$", np.nan, regex = True)
rental_data.weekly_rent = rental_data.weekly_rent.str.replace("£", "")
rental_data.weekly_rent = rental_data.weekly_rent.str.replace(",", "")
rental_data.weekly_rent = rental_data.weekly_rent.astype(float)
rental_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10574 entries, 0 to 12406 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 asset 10574 non-null object 1 operator 10574 non-null string 2 city 10574 non-null object 3 beds 10574 non-null int32 4 build_date 10574 non-null int32 5 room_type 10574 non-null object 6 weekly_rent 8141 non-null float64 dtypes: float64(1), int32(2), object(3), string(1) memory usage: 578.3+ KB
Now we have got all of the variables in the correct data type and have removed the NaN values from the other columns. We now need to deal with the missing values in the weekly_rent column.
We note that many of the missing data refers to room types or tenancy lengths discontinued from previous years and is therefore less damning then first appears. Given removing the empty rows still leaves us with $8,000$ data points, we will drop the rows where there is no weekly_rent rather than attempting to interpolate the data points to ensure we are working with more accurate, if slightly less complete, data.
rental_data.dropna(subset = ["weekly_rent"], inplace = True)
We note that following the removal of the 'Sub Classification' column, we have a range of weekly rents for each room type for each asset, without the sub-class to aid understanding. In order to simplify the data for modelling, we will need to aggregate this data. Common aggregation techniques include either the mean or the median.
In PBSA assets, it is not uncommon to have one or two incredibly premium rooms at the top of the asset which benefit from typical floor uplift and/or size benefits and/or the presence of balconies or access to exterior space. As a result of this, there will be assets which have outliers in the range of weekly rents for a specific room type. As a result of this, we have decided to calculate the median for each room type on account of it being less prone to being affected by extreme outliers.
We also recast the data types for asset, room_type, and city as strings.
#It is unlikely there will be an asset neamed the same thing in one city and so aggregating by these features is sufficient
rental_data = rental_data.groupby(["asset", "room_type", "city"]).agg(
operator = ("operator", "first"),
beds = ("beds", "first"),
build_date = ("build_date", "first"),
weekly_rent = ("weekly_rent", "median")
).reset_index()
rental_data[["asset", "room_type", "city"]] = rental_data[["asset", "room_type", "city"]].astype("string")
We now create a unique asset_id for each asset to have something clear to aggregate the data by. We then reorder the columns to bring the asset_id to the front.
rental_data["asset_id"] = rental_data["asset"] + rental_data["operator"] + rental_data["city"]
unique_asset_name_list = rental_data.asset_id.unique()
numbered_id_list = list(range(len(unique_asset_name_list)))
asset_id_dict = dict(zip(unique_asset_name_list, numbered_id_list))
rental_data["asset_id"] = rental_data["asset_id"].map(asset_id_dict)
cols_to_order = ["asset_id", "room_type", "weekly_rent"]
ordered_cols = ["asset_id"] + [col for col in rental_data.columns if col not in cols_to_order] + ["room_type"] + ["weekly_rent"]
rental_data = rental_data[ordered_cols]
print(rental_data.head(10))
print(rental_data.info())
asset_id asset city \
0 0 134 New Walk Leicester
1 1 136 New Walk Leicester
2 2 191 Kings Road Student Living Reading
3 2 191 Kings Road Student Living Reading
4 3 207 King Street Aberdeen
5 3 207 King Street Aberdeen
6 4 249 Windsor Court Birmingham
7 5 25 Cross Street Manchester
8 6 26 Great George Street Leeds
9 7 27 Kings Stables Road Edinburgh
operator beds build_date room_type weekly_rent
0 Hello Student 20 2016 Studio 181.00
1 Hello Student 30 2013 Studio 183.50
2 Bagri Foundation 72 2019 En-Suite 192.50
3 Bagri Foundation 72 2019 Studio 217.50
4 Mezzino 26 1997 Non En-Suite 103.50
5 Mezzino 26 1997 One Bed 199.00
6 Student Letting Company 73 2022 Studio 215.00
7 YPP 27 2016 Studio 248.77
8 HFS - Prestige Student Living 54 2024 Studio 299.00
9 Hello Student 166 2019 Studio 386.50
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1725 entries, 0 to 1724
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 asset_id 1725 non-null int64
1 asset 1725 non-null string
2 city 1725 non-null string
3 operator 1725 non-null string
4 beds 1725 non-null int32
5 build_date 1725 non-null int32
6 room_type 1725 non-null string
7 weekly_rent 1725 non-null float64
dtypes: float64(1), int32(2), int64(1), string(4)
memory usage: 94.5 KB
None
We now have the cleaned up data with each asset having a standard set of room types which have a single value ascribed for the weekly rent, which reflects the median of all the rents that were being charged for the sub-types of that room type. We now proceed to EDA.
In this section we shall consider key summary statistics for each variable as well as visualising the distributions of the data, the relationships between various factors, and assessing and dealing with any outliers that arise.
Below we import further modules we shall need for this section.
import matplotlib.pyplot as plt
import seaborn as sns
We start by considering the summary statistics for each variable. This gives us a sense of the size of our data set and allows us to more intuitively understand the data.
print(rental_data.describe(include = 'all'))
asset_id asset city operator beds \
count 1725.000000 1725 1725 1725 1725.000000
unique NaN 921 64 107 NaN
top NaN Crown House London UNITE Students NaN
freq NaN 8 175 144 NaN
mean 472.235942 NaN NaN NaN 314.461449
std 272.593296 NaN NaN NaN 243.355083
min 0.000000 NaN NaN NaN 17.000000
25% 233.000000 NaN NaN NaN 127.000000
50% 470.000000 NaN NaN NaN 250.000000
75% 710.000000 NaN NaN NaN 436.000000
max 938.000000 NaN NaN NaN 1469.000000
build_date room_type weekly_rent
count 1725.000000 1725 1725.000000
unique NaN 4 NaN
top NaN Studio NaN
freq NaN 803 NaN
mean 2014.648116 NaN 234.132290
std 6.284967 NaN 95.648662
min 1992.000000 NaN 75.000000
25% 2012.000000 NaN 168.500000
50% 2016.000000 NaN 213.000000
75% 2019.000000 NaN 275.000000
max 2024.000000 NaN 868.000000
From these summary statistics, we immediately notice that there are $1,725$ entries in our data set, reflecting $938$ different assets (with only $921$ unique assets as some share names), $64$ unique cities and $107$ operators. We note that London contains the most assets in our data set, which is not surprising given it contains by far the most Higher Education Institutions ("HEIs") in the UK and the largest student population. Furthermore, we note that UNITE Students operates the most assets in our data set.
Regarding beds, we note that the data appears to be positively skewed and there is large variance in the bed numbers as seen by the standard deviation and the range. build_date seems to be slightly negatively skewed, which is unsurprising given the boom in the PBSA market as an alternative investment class leading to an increase in development to keep up with rising student numbers. Finally, we note that weekly_rent appears to be positively skewed and that there is also a large range with a maximum weekly_rent of $868$ pounds per week.
We now consider the distribution of weekly_rent by examining a histogram and a box plot of the data.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(rental_data["weekly_rent"], kde = True, color = (0, 0.13, 0.27))
ax.set_title("Distribution of the Weekly Rent", fontweight = "semibold")
ax.set_xlabel("Weekly Rent", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.boxplot(x = rental_data["weekly_rent"],
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
)
ax.set_title("Box Plot of the Weekly Rent", fontweight = "semibold")
ax.set_xlabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
As first thought by looking at the summary statistics, the weekly_rent is indeed positively skewed. It is likely we will need to transform this data using a logarithmic transformation in order to prepare if for modelling. We will deal with this at a later stage.
Furthermore, the box plot seems to suggest there are a large amount of outliers. We will need to explore this further before deciding how to deal with this. However, from a first look, the 'outliers' grouped beyond the maximum value are likely due to different cities having wildly different weekly_rents (largely on account of differing sub-market real estate dynamics) rendering entire markets as 'outliers' given the positive skew of the data. An example would be London, which will charge far higher than the rest of the UK on account of its increased demand and the competing land uses in the capital rendering developments only viable on account of higher rents. However, we note that this may not be the case for all of the 'outliers' on the box plot above. Notably, the rent at almost $900$ pounds per week does indeed appear to be an outlier.
Given this, we consider the distribution and box plot of the weekly_rent on a city-by-city basis.
cities = sorted(rental_data.city.unique())
ncol = 2
row_dim = int(np.ceil(len(cities)/ncol))
plt.clf()
fig, axes = plt.subplots(nrows = row_dim, ncols = ncol, figsize = (10, 6*row_dim))
axes = axes.flatten()
fig.suptitle("Weekly Rent Distribution by City", fontweight = "bold")
for i, city in enumerate(cities):
ax = axes[i]
sns.histplot(rental_data[rental_data["city"] == city]["weekly_rent"], bins = 12, ax = ax, color = (0, 0.13, 0.27))
ax.set_title("{city}".format(city = city), fontweight = "semibold")
ax.set_xlabel("Weekly Rent", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for i in range(len(cities), len(axes)):
axes[i].axis("off")
plt.tight_layout()
plt.subplots_adjust(top = 0.975)
plt.show()
<Figure size 432x288 with 0 Axes>
For some cities, there doesn't seem to be enough data points to ascertain the shape of the distribution. However, for the larger cities, where there are more data points, generally there is a slight positive skew. Examples include London, Birmingham, Glasgow, and Leeds.
Below we consider the box plots on a city-by-city basis.
plt.clf()
fig, axes = plt.subplots(nrows = row_dim, ncols = ncol, figsize = (10, 6*row_dim))
axes = axes.flatten()
fig.suptitle("Weekly Rent Box Plot by City", fontweight = "bold")
for i, city in enumerate(cities):
ax = axes[i]
sns.boxplot(x = rental_data[rental_data["city"] == city]["weekly_rent"],
ax = ax,
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
)
ax.set_title("{city}".format(city = city), fontweight = "semibold")
ax.set_xlabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for i in range(len(cities), len(axes)):
axes[i].axis("off")
plt.tight_layout()
plt.subplots_adjust(top = 0.975)
plt.show()
<Figure size 432x288 with 0 Axes>
Immediately, we notice a lot fewer obvious outliers, which lends credence to our theory concerning the different cities having different markets.
However, there are still some notable exceptions, such as on the Edinburgh, Cardiff, Coventry, Lincoln, and Sheffield plots.
We note that for the cities where there are few data points, the sporadic nature of the data can lead to some points being identified as outliers when they may in fact reflect a reasonable top of the market.
Furthermore, in a similar manner to how when considering the data set as a whole, we found lots of outliers given the differing cities having different rental landscapes, we may be seeing the same above, given we are looking at all room types together and there are different markets for one-beds as there are for non en-suites. Thus, we subdivide further and consider the box plots of each room type on a city-by-city basis.
cities = sorted(rental_data.city.unique())
room_types = rental_data.room_type.unique()
ncol = 4
row_dim = len(cities)
empty_axes = []
plt.clf()
fig, axes = plt.subplots(nrows = row_dim, ncols = ncol, figsize = (12, 4*row_dim))
axes = axes.flatten()
fig.suptitle("Weekly Rent Box Plot by City and Room Type", fontweight = "bold")
for i, city in enumerate(cities):
for j, room_type in enumerate(room_types):
ax_idx = len(room_types)*i + j
ax = axes[ax_idx]
data = rental_data.loc[
(rental_data['room_type'] == room_type) & (rental_data['city'] == city), 'weekly_rent'
]
if data.empty:
empty_axes.append(ax_idx)
sns.boxplot(x = data,
ax = ax,
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
)
ax.set_title(f"{city} - {room_type}", fontweight = "semibold")
ax.set_xlabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for i in empty_axes:
axes[i].axis('off')
axes[i].set_title('')
plt.tight_layout()
plt.subplots_adjust(top = 0.975)
plt.show()
<Figure size 432x288 with 0 Axes>
Between this set of box plots and the box plots by city, we can identify which proposed outliers we think are true outliers. This will comprise the data points that reflect a typo or an error and we will remove them or correct them through additional verification. In order to do this, we will examine the outliers identified between the two sets of plots and use knowledge of reasonable market dynamics to identify the true outliers.
Below we have that iQ Brighton's non en-suite is actually a two-bed apartment. We are unable to confirm whether the price stated is actually for the apartment as a whole and should be split between two or is per person. iQ seems to have a lot of rooms like this and given we cannot confirm the accuracy of the data point, we elect to remove them.
rental_data.loc[(rental_data['city'] == 'Brighton')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 27 | 18 | Abacus | Brighton | iQ Student Accommodation | 351 | 2014 | En-Suite | 310.0 |
| 28 | 18 | Abacus | Brighton | iQ Student Accommodation | 351 | 2014 | Studio | 373.0 |
| 58 | 35 | Alumno Falmer | Brighton | HFS - Homes For Students | 71 | 2021 | En-Suite | 275.0 |
| 59 | 35 | Alumno Falmer | Brighton | HFS - Homes For Students | 71 | 2021 | Studio | 350.0 |
| 415 | 227 | Crown House | Brighton | CRM Students | 183 | 2022 | Studio | 320.0 |
| 629 | 342 | Hillfort House | Brighton | Student Roost | 378 | 2022 | En-Suite | 269.0 |
| 630 | 342 | Hillfort House | Brighton | Student Roost | 378 | 2022 | Studio | 339.0 |
| 633 | 344 | Holden Court | Brighton | CRM Students | 129 | 2022 | Studio | 330.0 |
| 634 | 345 | Hollingbury House | Brighton | Abodus Student Living | 195 | 2019 | One Bed | 350.0 |
| 635 | 345 | Hollingbury House | Brighton | Abodus Student Living | 195 | 2019 | Studio | 345.0 |
| 953 | 522 | Pavillon Point | Brighton | Fresh Student Living | 197 | 2021 | En-Suite | 270.0 |
| 954 | 522 | Pavillon Point | Brighton | Fresh Student Living | 197 | 2021 | Studio | 320.0 |
| 1005 | 552 | Promenade Student Living | Brighton | CRM Students | 156 | 2024 | En-Suite | 282.0 |
| 1006 | 552 | Promenade Student Living | Brighton | CRM Students | 156 | 2024 | Studio | 345.0 |
| 1037 | 569 | Ravilious House | Brighton | HFS - Prestige Student Living | 60 | 2023 | En-Suite | 272.5 |
| 1038 | 569 | Ravilious House | Brighton | HFS - Prestige Student Living | 60 | 2023 | Studio | 345.0 |
| 1230 | 675 | Stoneworks | Brighton | Aparto Student | 51 | 2017 | Studio | 310.0 |
| 1565 | 857 | Vogue Studios | Brighton | Aparto Student | 48 | 2017 | Studio | 310.0 |
| 1642 | 904 | iQ Brighton | Brighton | iQ Student Accommodation | 555 | 2020 | En-Suite | 286.0 |
| 1643 | 904 | iQ Brighton | Brighton | iQ Student Accommodation | 555 | 2020 | Non En-Suite | 301.5 |
| 1644 | 904 | iQ Brighton | Brighton | iQ Student Accommodation | 555 | 2020 | Studio | 361.0 |
| 1702 | 927 | iQ Sawmills | Brighton | iQ Student Accommodation | 51 | 2015 | En-Suite | 314.0 |
| 1703 | 927 | iQ Sawmills | Brighton | iQ Student Accommodation | 51 | 2015 | Studio | 346.0 |
rental_data.drop(
rental_data[(rental_data['asset'] == 'iQ Brighton') & (rental_data['room_type'] == 'Non En-Suite')].index,
inplace = True
)
The outlier for Bristol en-suites shows that an asset called King Square Studios has a large two-bed flat being identified as an en-suite. We are unable to confirm an accurate per person per week price and so we remove it for simplicity.
rental_data.loc[(rental_data['city'] == 'Bristol') & (rental_data['asset'] == 'King Square Studios')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 700 | 382 | King Square Studios | Bristol | Abodus Student Living | 301 | 2010 | En-Suite | 452.5 |
| 701 | 382 | King Square Studios | Bristol | Abodus Student Living | 301 | 2010 | One Bed | 450.0 |
| 702 | 382 | King Square Studios | Bristol | Abodus Student Living | 301 | 2010 | Studio | 375.0 |
rental_data.drop(
rental_data[(rental_data['asset'] == 'King Square Studios') & (rental_data['room_type'] == 'En-Suite')].index,
inplace = True
)
Below we see that Vita Student's Cannon Park asset is an outlier for en-suites in Coventry. This is as a result of there only being two data points for en-suites in this asset, priced at $277$ and $~978$ each, with the median taking the halfway point. Given the $978$ is a three-bed flat, we will correct this to take just the value of the $277$ cluster, as advertised by their website.
rental_data.loc[(rental_data['city'] == 'Coventry') & (rental_data['room_type'] == 'En-Suite')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 29 | 19 | Abbey House | Coventry | Mezzino | 115 | 2018 | En-Suite | 130.0 |
| 37 | 23 | Albany Student Village | Coventry | Mezzino | 436 | 2022 | En-Suite | 163.5 |
| 70 | 41 | Apollo Works | Coventry | Host Students | 407 | 2014 | En-Suite | 107.5 |
| 197 | 113 | Burges House | Coventry | Collegiate AC | 307 | 2015 | En-Suite | 195.0 |
| 212 | 121 | Calcott Ten | Coventry | HFS - Homes For Students | 737 | 2014 | En-Suite | 112.0 |
| 234 | 133 | Cannon Park | Coventry | Vita Student | 780 | 2022 | En-Suite | 508.0 |
| 239 | 136 | Canvas Coventry | Coventry | Canvas Student | 778 | 2020 | En-Suite | 149.0 |
| 352 | 192 | City Point | Coventry | Canvas Student | 385 | 2019 | En-Suite | 129.0 |
| 356 | 194 | City Village | Coventry | Downing Students | 600 | 2017 | En-Suite | 167.5 |
| 478 | 260 | Eden Square | Coventry | HFS - Prestige Student Living | 344 | 2020 | En-Suite | 154.0 |
| 595 | 323 | Gulson Gardens | Coventry | Fresh Student Living | 449 | 2020 | En-Suite | 145.0 |
| 606 | 329 | Harper Road | Coventry | Code Student Accommodation | 266 | 2020 | En-Suite | 134.0 |
| 667 | 363 | Infinity | Coventry | Novel Student | 505 | 2020 | En-Suite | 152.5 |
| 828 | 450 | Mercia Lodge | Coventry | CRM Students | 167 | 2015 | En-Suite | 131.0 |
| 833 | 453 | Merlin Point | Coventry | CRM Students | 155 | 2016 | En-Suite | 100.0 |
| 841 | 458 | Millennium View | Coventry | HFS - Homes For Students | 391 | 2017 | En-Suite | 114.0 |
| 935 | 513 | Paradise Student Village | Coventry | Axo Student Living | 1040 | 2019 | En-Suite | 99.5 |
| 974 | 533 | Pillar Box | Coventry | Collegiate AC | 129 | 2015 | En-Suite | 149.0 |
| 1024 | 563 | Queens Park House | Coventry | UNITE Students | 464 | 2002 | En-Suite | 104.0 |
| 1032 | 566 | Raglan House | Coventry | UNITE Students | 212 | 2007 | En-Suite | 107.0 |
| 1039 | 570 | Red Queen | Coventry | HFS - Universal Student Living | 210 | 2020 | En-Suite | 149.0 |
| 1133 | 620 | Sky Blue Point | Coventry | Host Students | 353 | 2004 | En-Suite | 107.0 |
| 1395 | 769 | The Oaks | Coventry | Student Roost | 961 | 2020 | En-Suite | 149.0 |
| 1442 | 791 | The Residence | Coventry | HFS - Prestige Student Living | 95 | 2019 | En-Suite | 160.0 |
| 1497 | 822 | Trinity View | Coventry | Prime Student Living | 614 | 2019 | En-Suite | 178.0 |
| 1576 | 864 | Weaver Place | Coventry | iQ Student Accommodation | 823 | 2020 | En-Suite | 177.0 |
| 1603 | 878 | Westwood Student Mews | Coventry | HFS - Homes For Students | 453 | 2019 | En-Suite | 150.0 |
rental_data.loc[(rental_data['city'] == 'Coventry')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['operator'] == 'Vita Student'),
'weekly_rent'
] = 277
We also note that the one-bed market has some questionable values in Coventry.
rental_data.loc[(rental_data['city'] == 'Coventry') & (rental_data['room_type'] == 'One Bed')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 12 | 9 | 33 Parkside | Coventry | HFS - Prestige Student Living | 262 | 2018 | One Bed | 165.0 |
| 213 | 121 | Calcott Ten | Coventry | HFS - Homes For Students | 737 | 2014 | One Bed | 229.0 |
| 357 | 194 | City Village | Coventry | Downing Students | 600 | 2017 | One Bed | 280.0 |
| 668 | 363 | Infinity | Coventry | Novel Student | 505 | 2020 | One Bed | 445.0 |
| 1396 | 769 | The Oaks | Coventry | Student Roost | 961 | 2020 | One Bed | 299.0 |
| 1498 | 822 | Trinity View | Coventry | Prime Student Living | 614 | 2019 | One Bed | 286.0 |
We note that the asset at 33 Parkside is actually offering two-bed apartments at that price. We remove them as we are unable to ascertain the true per person price. The Infinity asset also has an outlier, but we note that Novel Student is an incredibly premium operator and can charge those prices as it reflects the service offering they have. We remove the 33 Parkside entry but leave the Infinity entry below.
rental_data.drop(
rental_data[
(rental_data['asset'] == '33 Parkside') &
(rental_data['room_type'] == 'One Bed') &
(rental_data['city'] == 'Coventry')
].index,
inplace = True
)
We also note that Burges House has outlying values for its studios and en-suites which do not agree with one another, with the studio being an outlier in the market. For this reason, along with the fact that we have a lot for data points for Coventry anyway, we remove all Burges House entries.
rental_data.loc[(rental_data['city'] == 'Coventry') & (rental_data['room_type'].isin(['En-Suite', 'Studio']))]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 13 | 9 | 33 Parkside | Coventry | HFS - Prestige Student Living | 262 | 2018 | Studio | 217.0 |
| 29 | 19 | Abbey House | Coventry | Mezzino | 115 | 2018 | En-Suite | 130.0 |
| 30 | 19 | Abbey House | Coventry | Mezzino | 115 | 2018 | Studio | 185.0 |
| 37 | 23 | Albany Student Village | Coventry | Mezzino | 436 | 2022 | En-Suite | 163.5 |
| 38 | 23 | Albany Student Village | Coventry | Mezzino | 436 | 2022 | Studio | 202.5 |
| 70 | 41 | Apollo Works | Coventry | Host Students | 407 | 2014 | En-Suite | 107.5 |
| 71 | 41 | Apollo Works | Coventry | Host Students | 407 | 2014 | Studio | 165.5 |
| 197 | 113 | Burges House | Coventry | Collegiate AC | 307 | 2015 | En-Suite | 195.0 |
| 198 | 113 | Burges House | Coventry | Collegiate AC | 307 | 2015 | Studio | 99.0 |
| 212 | 121 | Calcott Ten | Coventry | HFS - Homes For Students | 737 | 2014 | En-Suite | 112.0 |
| 214 | 121 | Calcott Ten | Coventry | HFS - Homes For Students | 737 | 2014 | Studio | 195.0 |
| 234 | 133 | Cannon Park | Coventry | Vita Student | 780 | 2022 | En-Suite | 277.0 |
| 235 | 133 | Cannon Park | Coventry | Vita Student | 780 | 2022 | Studio | 366.5 |
| 239 | 136 | Canvas Coventry | Coventry | Canvas Student | 778 | 2020 | En-Suite | 149.0 |
| 240 | 136 | Canvas Coventry | Coventry | Canvas Student | 778 | 2020 | Studio | 174.0 |
| 352 | 192 | City Point | Coventry | Canvas Student | 385 | 2019 | En-Suite | 129.0 |
| 353 | 192 | City Point | Coventry | Canvas Student | 385 | 2019 | Studio | 164.0 |
| 356 | 194 | City Village | Coventry | Downing Students | 600 | 2017 | En-Suite | 167.5 |
| 358 | 194 | City Village | Coventry | Downing Students | 600 | 2017 | Studio | 230.0 |
| 396 | 217 | Copper Towers | Coventry | Vita Student | 496 | 2023 | Studio | 269.5 |
| 478 | 260 | Eden Square | Coventry | HFS - Prestige Student Living | 344 | 2020 | En-Suite | 154.0 |
| 479 | 260 | Eden Square | Coventry | HFS - Prestige Student Living | 344 | 2020 | Studio | 242.0 |
| 507 | 278 | Fairfax Street | Coventry | Code Student Accommodation | 1206 | 2018 | Studio | 215.0 |
| 595 | 323 | Gulson Gardens | Coventry | Fresh Student Living | 449 | 2020 | En-Suite | 145.0 |
| 597 | 323 | Gulson Gardens | Coventry | Fresh Student Living | 449 | 2020 | Studio | 194.5 |
| 606 | 329 | Harper Road | Coventry | Code Student Accommodation | 266 | 2020 | En-Suite | 134.0 |
| 667 | 363 | Infinity | Coventry | Novel Student | 505 | 2020 | En-Suite | 152.5 |
| 669 | 363 | Infinity | Coventry | Novel Student | 505 | 2020 | Studio | 250.0 |
| 683 | 371 | Julian Court | Coventry | Mezzino | 66 | 2018 | Studio | 172.5 |
| 828 | 450 | Mercia Lodge | Coventry | CRM Students | 167 | 2015 | En-Suite | 131.0 |
| 829 | 450 | Mercia Lodge | Coventry | CRM Students | 167 | 2015 | Studio | 140.0 |
| 833 | 453 | Merlin Point | Coventry | CRM Students | 155 | 2016 | En-Suite | 100.0 |
| 834 | 453 | Merlin Point | Coventry | CRM Students | 155 | 2016 | Studio | 192.5 |
| 841 | 458 | Millennium View | Coventry | HFS - Homes For Students | 391 | 2017 | En-Suite | 114.0 |
| 842 | 458 | Millennium View | Coventry | HFS - Homes For Students | 391 | 2017 | Studio | 189.0 |
| 935 | 513 | Paradise Student Village | Coventry | Axo Student Living | 1040 | 2019 | En-Suite | 99.5 |
| 936 | 513 | Paradise Student Village | Coventry | Axo Student Living | 1040 | 2019 | Studio | 160.0 |
| 974 | 533 | Pillar Box | Coventry | Collegiate AC | 129 | 2015 | En-Suite | 149.0 |
| 975 | 533 | Pillar Box | Coventry | Collegiate AC | 129 | 2015 | Studio | 159.0 |
| 1024 | 563 | Queens Park House | Coventry | UNITE Students | 464 | 2002 | En-Suite | 104.0 |
| 1025 | 563 | Queens Park House | Coventry | UNITE Students | 464 | 2002 | Studio | 170.0 |
| 1032 | 566 | Raglan House | Coventry | UNITE Students | 212 | 2007 | En-Suite | 107.0 |
| 1033 | 566 | Raglan House | Coventry | UNITE Students | 212 | 2007 | Studio | 186.0 |
| 1039 | 570 | Red Queen | Coventry | HFS - Universal Student Living | 210 | 2020 | En-Suite | 149.0 |
| 1124 | 614 | Sherbourne Student Village | Coventry | Axo Student Living | 209 | 2016 | Studio | 146.5 |
| 1133 | 620 | Sky Blue Point | Coventry | Host Students | 353 | 2004 | En-Suite | 107.0 |
| 1395 | 769 | The Oaks | Coventry | Student Roost | 961 | 2020 | En-Suite | 149.0 |
| 1397 | 769 | The Oaks | Coventry | Student Roost | 961 | 2020 | Studio | 279.0 |
| 1442 | 791 | The Residence | Coventry | HFS - Prestige Student Living | 95 | 2019 | En-Suite | 160.0 |
| 1443 | 791 | The Residence | Coventry | HFS - Prestige Student Living | 95 | 2019 | Studio | 199.0 |
| 1497 | 822 | Trinity View | Coventry | Prime Student Living | 614 | 2019 | En-Suite | 178.0 |
| 1499 | 822 | Trinity View | Coventry | Prime Student Living | 614 | 2019 | Studio | 243.0 |
| 1576 | 864 | Weaver Place | Coventry | iQ Student Accommodation | 823 | 2020 | En-Suite | 177.0 |
| 1578 | 864 | Weaver Place | Coventry | iQ Student Accommodation | 823 | 2020 | Studio | 224.5 |
| 1603 | 878 | Westwood Student Mews | Coventry | HFS - Homes For Students | 453 | 2019 | En-Suite | 150.0 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Burges House') &
(rental_data['city'] == 'Coventry')
].index,
inplace = True
)
Again the Vita Student asset contains a three-bed that is being marketed at $868$ per week for the entire apartment. We replace this with $\frac{868}{3} = 289.33$.
rental_data.loc[(rental_data['city'] == 'Edinburgh') & (rental_data['room_type'] == 'En-Suite')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 83 | 49 | Arran House | Edinburgh | Yugo | 308 | 2015 | En-Suite | 205.5 |
| 193 | 111 | Buccleuch Street | Edinburgh | Hello Student | 86 | 2016 | En-Suite | 284.0 |
| 233 | 132 | Canal Point | Edinburgh | Yugo | 240 | 2014 | En-Suite | 211.0 |
| 482 | 262 | Edinburgh College Residences | Edinburgh | Allied Student Accommodation | 110 | 2010 | En-Suite | 204.0 |
| 551 | 301 | Gateway Apartments | Edinburgh | HFS - Prestige Student Living | 170 | 2018 | En-Suite | 217.5 |
| 566 | 309 | Goods Corner | Edinburgh | HFS - Prestige Student Living | 108 | 2018 | En-Suite | 252.0 |
| 568 | 310 | Gorgie | Edinburgh | Student Castle | 249 | 2022 | En-Suite | 212.0 |
| 598 | 324 | Haddington Place | Edinburgh | CRM Students | 240 | 2016 | En-Suite | 305.0 |
| 612 | 333 | Haymarket | Edinburgh | Abodus Student Living | 168 | 2014 | En-Suite | 220.0 |
| 672 | 365 | Iona Street | Edinburgh | Vita Student | 205 | 2023 | En-Suite | 868.0 |
| 816 | 442 | Mayfield Residences | Edinburgh | HFS - Prestige Student Living | 148 | 2022 | En-Suite | 220.0 |
| 818 | 443 | McDonald Road | Edinburgh | HFS - Prestige Student Living | 135 | 2008 | En-Suite | 209.0 |
| 856 | 467 | Murieston Crescent | Edinburgh | CRM Students | 120 | 2022 | En-Suite | 245.0 |
| 874 | 477 | New Park | Edinburgh | Downing Students | 238 | 2017 | En-Suite | 225.0 |
| 993 | 545 | Portsburgh Court | Edinburgh | Student Roost | 229 | 2006 | En-Suite | 314.0 |
| 1130 | 618 | Silk Mill | Edinburgh | Novel Student | 225 | 2021 | En-Suite | 274.0 |
| 1388 | 765 | The Mill House | Edinburgh | HFS - Homes For Students | 257 | 2017 | En-Suite | 212.0 |
| 1597 | 874 | Westfield | Edinburgh | Student Castle | 396 | 2022 | En-Suite | 214.0 |
| 1664 | 912 | iQ Elliott House | Edinburgh | iQ Student Accommodation | 138 | 2015 | En-Suite | 279.5 |
| 1669 | 914 | iQ Fountainbridge | Edinburgh | iQ Student Accommodation | 314 | 2017 | En-Suite | 263.0 |
| 1674 | 916 | iQ Grove | Edinburgh | iQ Student Accommodation | 325 | 2010 | En-Suite | 268.5 |
rental_data.loc[(rental_data['city'] == 'Edinburgh')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['operator'] == 'Vita Student'),
'weekly_rent'
] = 289.33
Riverside House in Guildford has dual occupancy studios that are being classified as en-suites. We remove as they are an inconsistency.
rental_data.loc[(rental_data['city'] == 'Guildford') & (rental_data['room_type'] == 'En-Suite')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 126 | 72 | Bankside Student Village | Guildford | CRM Students | 346 | 2021 | En-Suite | 230.0 |
| 593 | 322 | Guilden Village | Guildford | Future Generation Asset Management Limited | 553 | 2021 | En-Suite | 209.0 |
| 1059 | 580 | Riverside House | Guildford | UniLife | 90 | 2022 | En-Suite | 435.0 |
| 1100 | 603 | Scape Guildford | Guildford | Scape Student Living | 544 | 2015 | En-Suite | 204.0 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Riverside House') &
(rental_data['room_type'] == 'En-Suite') &
(rental_data['city'] == 'Guildford')
].index,
inplace = True
)
iQ Leeds in Leeds is another example of a mislabelled two-bed apartment that we cannot verify. We remove it.
rental_data.loc[(rental_data['city'] == 'Leeds') & (rental_data['room_type'] == 'Non En-Suite')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 255 | 144 | Carr Mills Leeds | Leeds | N Joy Student Living | 298 | 2005 | Non En-Suite | 142.0 |
| 372 | 202 | Clarence Dock Village | Leeds | UNITE Students | 613 | 1994 | Non En-Suite | 116.0 |
| 659 | 358 | Hyde Park | Leeds | Almero Student Mansions | 72 | 2015 | Non En-Suite | 133.0 |
| 1690 | 922 | iQ Leeds | Leeds | iQ Student Accommodation | 634 | 2009 | Non En-Suite | 214.0 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'iQ Leeds') &
(rental_data['room_type'] == 'Non En-Suite') &
(rental_data['city'] == 'Leeds')
].index,
inplace = True
)
Similarly, the non en-suite at iQ Pavilions is actually a two-bed apartment and so we remove it for similar reasons.
rental_data.loc[(rental_data['city'] == 'Lincoln')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 168 | 96 | Brayford Quay | Lincoln | HFS - Homes For Students | 425 | 2005 | En-Suite | 126.0 |
| 169 | 96 | Brayford Quay | Lincoln | HFS - Homes For Students | 425 | 2005 | One Bed | 185.0 |
| 170 | 96 | Brayford Quay | Lincoln | HFS - Homes For Students | 425 | 2005 | Studio | 159.0 |
| 408 | 224 | Crosstrend House | Lincoln | Gather Students | 76 | 2012 | En-Suite | 160.0 |
| 409 | 224 | Crosstrend House | Lincoln | Gather Students | 76 | 2012 | Studio | 150.0 |
| 578 | 315 | Gravity - Lincoln | Lincoln | IconInc | 138 | 2019 | En-Suite | 165.0 |
| 579 | 315 | Gravity - Lincoln | Lincoln | IconInc | 138 | 2019 | One Bed | 353.5 |
| 580 | 315 | Gravity - Lincoln | Lincoln | IconInc | 138 | 2019 | Studio | 220.5 |
| 611 | 332 | Hayes Wharf House | Lincoln | iQ Student Accommodation | 222 | 2003 | En-Suite | 140.0 |
| 947 | 520 | Pavilions | Lincoln | iQ Student Accommodation | 1329 | 2006 | En-Suite | 136.0 |
| 948 | 520 | Pavilions | Lincoln | iQ Student Accommodation | 1329 | 2006 | Non En-Suite | 170.0 |
| 949 | 520 | Pavilions | Lincoln | iQ Student Accommodation | 1329 | 2006 | One Bed | 252.0 |
| 950 | 520 | Pavilions | Lincoln | iQ Student Accommodation | 1329 | 2006 | Studio | 182.5 |
| 976 | 534 | Pine Mill | Lincoln | Luna Students | 361 | 2021 | En-Suite | 150.5 |
| 1091 | 598 | Saul House | Lincoln | APS Property Group Ltd | 69 | 2014 | En-Suite | 137.0 |
| 1245 | 685 | Student Castle Lincoln | Lincoln | Student Castle | 116 | 2014 | En-Suite | 153.0 |
| 1246 | 685 | Student Castle Lincoln | Lincoln | Student Castle | 116 | 2014 | Studio | 191.5 |
| 1375 | 755 | The Junxion | Lincoln | Mezzino | 569 | 2004 | En-Suite | 128.5 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Pavilions') &
(rental_data['room_type'] == 'Non En-Suite') &
(rental_data['city'] == 'Lincoln')
].index,
inplace = True
)
The YPP Gravity Residence asset in Liverpool contains a two-bed that is being marketed at $363.46$ per week for the entire apartment. We replace this with $\frac{363.46}{2} = 181.73$.
rental_data.loc[(rental_data['city'] == 'Liverpool') & (rental_data['room_type'] == 'En-Suite')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 33 | 21 | Ablett House | Liverpool | Yugo | 396 | 2015 | En-Suite | 158.00 |
| 39 | 24 | Albert Court | Liverpool | Campus Living Villages | 516 | 2006 | En-Suite | 142.00 |
| 68 | 40 | Apollo Court | Liverpool | Student Roost | 221 | 2004 | En-Suite | 132.00 |
| 79 | 46 | Arndale House | Liverpool | X1 Lettings | 160 | 2012 | En-Suite | 140.00 |
| 82 | 48 | Arrad House | Liverpool | UNITE Students | 74 | 2000 | En-Suite | 119.00 |
| 85 | 50 | Art School Lofts | Liverpool | Hello Student | 64 | 2012 | En-Suite | 133.00 |
| 105 | 61 | Atlantic Point | Liverpool | UNITE Students | 928 | 1999 | En-Suite | 129.50 |
| 147 | 83 | Benson Yard | Liverpool | Fresh Student Living | 404 | 2023 | En-Suite | 179.00 |
| 202 | 116 | Byrom Point | Liverpool | Student Roost | 398 | 2016 | En-Suite | 170.00 |
| 220 | 124 | Calico | Liverpool | Fresh Student Living | 735 | 2019 | En-Suite | 171.00 |
| 228 | 128 | Cambridge Court | Liverpool | UNITE Students | 474 | 1999 | En-Suite | 119.50 |
| 247 | 140 | Capital Gate | Liverpool | Student Roost | 432 | 2004 | En-Suite | 148.00 |
| 278 | 155 | Cedar House | Liverpool | UNITE Students | 102 | 2001 | En-Suite | 165.00 |
| 340 | 185 | Chatham Lodge | Liverpool | Hello Student | 50 | 2010 | En-Suite | 135.00 |
| 397 | 218 | Copperas House | Liverpool | Urban Sleep | 280 | 2020 | En-Suite | 166.00 |
| 498 | 272 | Europa | Liverpool | Fresh Student Living | 592 | 2014 | En-Suite | 134.00 |
| 508 | 279 | Falkland House | Liverpool | Cloud Student Homes | 106 | 2015 | En-Suite | 129.50 |
| 573 | 313 | Grand Central | Liverpool | UNITE Students | 1236 | 2003 | En-Suite | 137.00 |
| 581 | 316 | Gravity Residence | Liverpool | YPP | 104 | 2018 | En-Suite | 363.46 |
| 604 | 328 | Hardman House | Liverpool | Urban Sleep | 350 | 2019 | En-Suite | 164.00 |
| 614 | 334 | Hayward House | Liverpool | Hello Student | 74 | 2013 | En-Suite | 131.50 |
| 641 | 348 | Hope Street Apartments | Liverpool | Host Students | 346 | 2015 | En-Suite | 154.00 |
| 643 | 349 | Horizon Heights | Liverpool | UNITE Students | 1085 | 2019 | En-Suite | 175.00 |
| 670 | 364 | Innovo House | Liverpool | CRM Students | 126 | 2021 | En-Suite | 145.00 |
| 740 | 401 | Lennon Studios | Liverpool | UNITE Students | 248 | 2001 | En-Suite | 126.50 |
| 805 | 435 | Maple House | Liverpool | Hello Student | 147 | 2012 | En-Suite | 136.00 |
| 852 | 464 | Moorfield | Liverpool | UNITE Students | 416 | 2000 | En-Suite | 136.00 |
| 859 | 469 | Myrtle Street - Apartments B | Liverpool | Urban Sleep | 260 | 2015 | En-Suite | 162.00 |
| 861 | 470 | Myrtle Street Apartments | Liverpool | Urban Sleep | 260 | 2015 | En-Suite | 162.00 |
| 912 | 499 | One Islington Plaza | Liverpool | HFS - Urban Student Life | 317 | 2019 | En-Suite | 152.50 |
| 931 | 510 | Paddington Park House | Liverpool | HFS - Homes For Students | 390 | 1999 | En-Suite | 137.00 |
| 963 | 527 | Phoenix Place | Liverpool | Propeller Lettings | 348 | 2018 | En-Suite | 137.00 |
| 1007 | 553 | Prospect Point | Liverpool | UNITE Students | 635 | 2003 | En-Suite | 131.00 |
| 1183 | 648 | St Lukes View | Liverpool | UNITE Students | 776 | 2017 | En-Suite | 170.00 |
| 1279 | 703 | The Arch | Liverpool | Downing Students | 261 | 2014 | En-Suite | 157.00 |
| 1297 | 713 | The Bridewell | Liverpool | Caro Student Living | 87 | 2015 | En-Suite | 135.00 |
| 1323 | 728 | The Edge | Liverpool | X1 Lettings | 231 | 2015 | En-Suite | 141.50 |
| 1379 | 759 | The Lantern | Liverpool | Fresh Student Living | 412 | 2018 | En-Suite | 172.00 |
| 1398 | 770 | The Octagon | Liverpool | Hello Student | 19 | 2013 | En-Suite | 157.00 |
| 1510 | 828 | True Student Liverpool | Liverpool | True Student | 999 | 2021 | En-Suite | 163.00 |
| 1531 | 839 | Unity Square | Liverpool | Mapleisle Ltd | 240 | 2019 | En-Suite | 147.50 |
| 1672 | 915 | iQ Great Newton House | Liverpool | iQ Student Accommodation | 294 | 2002 | En-Suite | 150.00 |
rental_data.loc[(rental_data['city'] == 'Liverpool')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['operator'] == 'YPP'),
'weekly_rent'
] = 181.73
We note that the Chapter non en-suites in London above $300$ are actually two-bed and three-bed apartments and so we remove them as we cannot verify the per person rental charge.
rental_data.loc[(rental_data['city'] == 'London')
& (rental_data['room_type'] == 'Non En-Suite')
& (rental_data['operator'] == 'Chapter London')
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 312 | 172 | Chapter Ealing | London | Chapter London | 424 | 2019 | Non En-Suite | 264.0 |
| 321 | 176 | Chapter Lewisham | London | Chapter London | 611 | 2016 | Non En-Suite | 284.0 |
| 324 | 177 | Chapter Old Street | London | Chapter London | 482 | 2015 | Non En-Suite | 416.5 |
| 328 | 178 | Chapter Portobello | London | Chapter London | 271 | 2011 | Non En-Suite | 329.5 |
| 331 | 179 | Chapter South Bank | London | Chapter London | 233 | 2010 | Non En-Suite | 516.5 |
| 333 | 180 | Chapter Spitalfields | London | Chapter London | 1117 | 2010 | Non En-Suite | 429.0 |
rental_data.drop(
rental_data[
(rental_data['operator'] == 'Chapter London') &
(rental_data['room_type'] == 'Non En-Suite') &
(rental_data['city'] == 'London') &
(rental_data['weekly_rent'] > 300)
].index,
inplace = True
)
We note that the Luna Hatfield asset is an outlier for studios. This is a result of it being in Hatfield but being classified as London. Whilst there are other areas around London, such as Egham, that clearly benefit from the proximity to the capital and whose rental tones reflect as such, this asset in Hatfield is further away and does not appear to be in line with what we would expect. Furthermore, it is the only asset in our data set in Hatfield, so we are reluctant to reclassify the assets to a new city. We therefore elect to remove it.
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Luna Hatfield') &
(rental_data['city'] == 'London')
].index,
inplace = True
)
The en-suites at Apex Heights are actually two-bed en-suite clusters priced at $115$ per week per person. We revalue them as such.
rental_data.loc[(rental_data['city'] == 'Luton')]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 66 | 39 | Apex Heights | Luton | Apex Student Living | 40 | 2013 | En-Suite | 142.5 |
| 67 | 39 | Apex Heights | Luton | Apex Student Living | 40 | 2013 | Studio | 140.0 |
rental_data.loc[(rental_data['city'] == 'Luton')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['operator'] == 'Apex Student Living'),
'weekly_rent'
] = 115
Whilst it has not been classified as an outlier, there appears to be another iQ non en-suite in Manchester at Kerria Apartments far above the rest of the market. Further research reveals these are in fact two-beds and they are removed for reasons as above.
rental_data.loc[(rental_data['city'] == 'Manchester') &
(rental_data['room_type'] == 'Non En-Suite')
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 176 | 100 | Bridgewater Heights | Manchester | UNITE Students | 529 | 2012 | Non En-Suite | 273.0 |
| 307 | 169 | Chancellors Court | Manchester | Mezzino | 190 | 1997 | Non En-Suite | 136.0 |
| 571 | 311 | Grafton Street | Manchester | Sanctuary Student | 590 | 2007 | Non En-Suite | 151.0 |
| 692 | 377 | Kerria Apartments | Manchester | iQ Student Accommodation | 350 | 2001 | Non En-Suite | 328.0 |
| 792 | 429 | Manchester Student Village | Manchester | Dwell Student Living | 1017 | 1997 | Non En-Suite | 168.0 |
| 873 | 476 | New Medlock House | Manchester | UNITE Students | 671 | 2001 | Non En-Suite | 195.0 |
| 1056 | 578 | River Street Tower | Manchester | Canvas Student | 792 | 2020 | Non En-Suite | 261.5 |
| 1361 | 747 | The Grafton | Manchester | Dwell Student Living | 145 | 2010 | Non En-Suite | 167.0 |
| 1559 | 853 | Victoria Point | Manchester | Hello Student | 561 | 2008 | Non En-Suite | 167.5 |
| 1602 | 877 | Weston Court | Manchester | Dwell Student Living | 140 | 1998 | Non En-Suite | 147.0 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Kerria Apartments') &
(rental_data['city'] == 'Manchester') &
(rental_data['room_type'] == 'Non En-Suite')
].index,
inplace = True
)
Again the Vita Student asset contains a two-bed that is being marketed at $405$ and $446$ per week for the entire apartment. We replace this with $\frac{425.50}{2} = 212.75$.
rental_data.loc[(rental_data['city'] == 'Newcastle')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['weekly_rent'] > 400)
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 1599 | 875 | Westgate - Vita Student | Newcastle | Vita Student | 259 | 2016 | En-Suite | 425.5 |
rental_data.loc[(rental_data['city'] == 'Newcastle')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['operator'] == 'Vita Student'),
'weekly_rent'
] = 212.75
This asset in Nottingham contains a two-bed that is being marketed at $405$ per week for the entire apartment. We replace this with $\frac{405}{2} = 202.5$.
rental_data.loc[(rental_data['city'] == 'Nottingham')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['weekly_rent'] > 400)
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 252 | 143 | Carlton Building | Nottingham | Bee Hive - Harington Investments | 32 | 2022 | En-Suite | 405.0 |
rental_data.loc[(rental_data['city'] == 'Nottingham')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['asset'] == 'Carlton Building'),
'weekly_rent'
] = 202.5
We note that the singular Oxford non en-suite is obviously incorrect and actually a one-bed flat. We relabel it below and then recalculate the median price for a one-bed at West Way Square as there is already that classification.
rental_data.loc[(rental_data['city'] == 'Oxford')
& (rental_data.room_type.isin(['Non En-Suite', 'One Bed']))
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 1151 | 630 | Spireworks | Oxford | Aparto Student | 136 | 2021 | One Bed | 419.0 |
| 1591 | 871 | West Way Square | Oxford | HFS - Prestige Student Living | 261 | 2020 | Non En-Suite | 565.0 |
| 1592 | 871 | West Way Square | Oxford | HFS - Prestige Student Living | 261 | 2020 | One Bed | 480.0 |
rental_data.loc[(rental_data['city'] == 'Oxford')
& (rental_data['room_type'] == 'Non En-Suite'),
'room_type'
] = 'One Bed'
west_way_one_bed_median = rental_data[(rental_data['city'] == 'Oxford') &
(rental_data['asset'] == 'West Way Square') &
(rental_data['room_type'] == 'One Bed')
]['weekly_rent'].median()
rental_data.loc[(rental_data['city'] == 'Oxford')
& (rental_data['room_type'] == 'One Bed')
& (rental_data['asset'] == 'West Way Square'),
'weekly_rent'
] = west_way_one_bed_median
rental_data.drop(1591, inplace = True)
The Friargate Court asset in Preston is achieving far above the rest of the market to an unrealistic extent. It is also unclear whether it is solely student accommodation as it appears to be marketed as residential for working people too. Therefore, we remove it to maintain consistency.
rental_data.loc[(rental_data['city'] == 'Preston')
& (rental_data['room_type'].isin(['En-Suite', 'Studio']))
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 236 | 134 | Canterbury Hall | Preston | Nurtur Student Living | 191 | 2018 | Studio | 164.500 |
| 529 | 289 | Foundry Court | Preston | HFS - Homes For Students | 438 | 2003 | En-Suite | 110.500 |
| 541 | 296 | Friargate Court | Preston | Portergate Property Management | 244 | 2016 | En-Suite | 167.500 |
| 542 | 296 | Friargate Court | Preston | Portergate Property Management | 244 | 2016 | Studio | 275.000 |
| 681 | 370 | Jubilee Court | Preston | Cloud Student Homes | 246 | 2016 | En-Suite | 107.500 |
| 682 | 370 | Jubilee Court | Preston | Cloud Student Homes | 246 | 2016 | Studio | 165.000 |
| 738 | 400 | Leighton Hall | Preston | HFS - Urban Student Life | 298 | 2005 | En-Suite | 95.000 |
| 739 | 400 | Leighton Hall | Preston | HFS - Urban Student Life | 298 | 2005 | Studio | 165.000 |
| 850 | 463 | Moor Lane Halls | Preston | Sanctuary Student | 498 | 2008 | En-Suite | 89.000 |
| 1367 | 750 | The Guild Tavern | Preston | Metro Student Accommodation | 40 | 2013 | En-Suite | 108.000 |
| 1368 | 750 | The Guild Tavern | Preston | Metro Student Accommodation | 40 | 2013 | Studio | 147.000 |
| 1373 | 754 | The Jazz Bar | Preston | Metro Student Accommodation | 40 | 2013 | En-Suite | 103.000 |
| 1374 | 754 | The Jazz Bar | Preston | Metro Student Accommodation | 40 | 2013 | Studio | 178.000 |
| 1464 | 803 | The Tramshed | Preston | HFS - Prestige Student Living | 316 | 2017 | En-Suite | 122.500 |
| 1465 | 803 | The Tramshed | Preston | HFS - Prestige Student Living | 316 | 2017 | Studio | 185.000 |
| 1496 | 821 | Trinity Student Village | Preston | HFS - Homes For Students | 424 | 2002 | En-Suite | 121.000 |
| 1534 | 841 | Urban Hub | Preston | HFS - Prestige Student Living | 425 | 2022 | Studio | 185.000 |
| 1568 | 859 | Walker Street | Preston | Sanctuary Student | 175 | 2012 | En-Suite | 105.000 |
| 1569 | 859 | Walker Street | Preston | Sanctuary Student | 175 | 2012 | Studio | 151.750 |
| 1570 | 860 | Warehouse Apartments | Preston | Warehouse Students Ltd | 234 | 2004 | En-Suite | 95.410 |
| 1571 | 860 | Warehouse Apartments | Preston | Warehouse Students Ltd | 234 | 2004 | Studio | 134.995 |
| 1687 | 921 | iQ Kopa | Preston | iQ Student Accommodation | 849 | 2008 | En-Suite | 121.000 |
| 1688 | 921 | iQ Kopa | Preston | iQ Student Accommodation | 849 | 2008 | Studio | 168.000 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Friargate Court') &
(rental_data['room_type'].isin(['En-Suite', 'Studio'])) &
(rental_data['city'] == 'Preston')
].index,
inplace = True
)
Sheffield purportedly has a few outliers. We start by examining the studios above $230$ per week to better understand the top of the market. We note that the three outliers as per the box plot are all above $250$.
rental_data.loc[(rental_data['city'] == 'Sheffield')
& (rental_data['room_type'] == 'Studio')
& (rental_data['weekly_rent'] > 230)
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 122 | 69 | Bailey Fields | Sheffield | Now Students | 543 | 2018 | Studio | 234.0 |
| 417 | 226 | Crown House | Sheffield | HFS - Prestige Student Living | 355 | 2017 | Studio | 255.0 |
| 632 | 343 | Hillside House | Sheffield | Novel Student | 250 | 2021 | Studio | 295.0 |
| 721 | 390 | Knight House | Sheffield | iQ Student Accommodation | 257 | 2019 | Studio | 238.0 |
| 1220 | 669 | Steelworks | Sheffield | HFS - Prestige Student Living | 691 | 2021 | Studio | 265.0 |
| 1276 | 701 | Telephone House - Vita Student | Sheffield | Vita Student | 366 | 2015 | Studio | 235.5 |
The assets achieving a premium are not ridiculously above the rest of the market and we note that Hillside House, which is achieving the highest rents in the market, is operated by Novel Student, which are known for being a premium PBSA operator.
Again the Vita Student asset contains a two-bed and a three-bed that is being marketed at a price for the entire apartment. We replace this with $174.17$, which is the true median.
rental_data.loc[(rental_data['city'] == 'Sheffield')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['weekly_rent'] > 400)
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 1275 | 701 | Telephone House - Vita Student | Sheffield | Vita Student | 366 | 2015 | En-Suite | 429.5 |
rental_data.loc[(rental_data['city'] == 'Sheffield')
& (rental_data['room_type'] == 'En-Suite')
& (rental_data['operator'] == 'Vita Student'),
'weekly_rent'
] = 174.17
Similarly, Sovereign Newbank House is being marketed for the entire two and three bedroom flats. We correct it to the true median.
rental_data.loc[(rental_data['city'] == 'Sheffield')
& (rental_data['room_type'] == 'Non En-Suite')
& (rental_data['weekly_rent'] > 250)
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 1146 | 628 | Sovereign Newbank House | Sheffield | Xenia Students | 236 | 2015 | Non En-Suite | 293.75 |
rental_data.loc[(rental_data['city'] == 'Sheffield')
& (rental_data['room_type'] == 'Non En-Suite')
& (rental_data['operator'] == 'Xenia Students'),
'weekly_rent'
] = 143.75
Given the established price of studios, the price for one-beds seems reasonable within the context of the market in Sheffield at the upper end, but we note there is an extremely cheap one-bed at iQ Steel. Further examination shows this is double occupation and the price we have recorded is for one person. We double it to attain an accurate sense of the real price.
rental_data.loc[(rental_data['city'] == 'Sheffield')
& (rental_data['room_type'] == 'One Bed')
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 121 | 69 | Bailey Fields | Sheffield | Now Students | 543 | 2018 | One Bed | 319.00 |
| 206 | 118 | COSMOS Sheffield | Sheffield | CRM Students | 860 | 2021 | One Bed | 239.00 |
| 287 | 159 | Central Place (Formerly Sheffield 2) | Sheffield | Student Roost | 389 | 2003 | One Bed | 226.00 |
| 414 | 226 | Crown House | Sheffield | HFS - Prestige Student Living | 355 | 2017 | One Bed | 330.00 |
| 637 | 346 | Hollis Croft | Sheffield | Student Roost | 972 | 2019 | One Bed | 277.00 |
| 656 | 356 | Huttons Buildings | Sheffield | City Estates | 164 | 2016 | One Bed | 204.00 |
| 986 | 540 | Porterbrook Apartments | Sheffield | City Estates | 105 | 2016 | One Bed | 222.50 |
| 1009 | 554 | Provincial House | Sheffield | Hello Student | 107 | 2017 | One Bed | 265.00 |
| 1046 | 573 | Redvers Tower | Sheffield | HFS - Homes For Students | 170 | 2016 | One Bed | 240.00 |
| 1120 | 612 | Sharman Court | Sheffield | Student Castle | 397 | 2016 | One Bed | 248.50 |
| 1147 | 628 | Sovereign Newbank House | Sheffield | Xenia Students | 236 | 2015 | One Bed | 241.25 |
| 1217 | 668 | Steel City | Sheffield | Future Generation Asset Management Limited | 324 | 2019 | One Bed | 253.50 |
| 1501 | 823 | Trippet Lane | Sheffield | Hello Student | 63 | 2017 | One Bed | 228.00 |
| 1707 | 929 | iQ Steel | Sheffield | iQ Student Accommodation | 187 | 2009 | One Bed | 154.00 |
rental_data.loc[(rental_data['city'] == 'Sheffield')
& (rental_data['room_type'] == 'One Bed')
& (rental_data['asset'] == 'iQ Steel'),
'weekly_rent'
] = 308
This Stanley Studios non en-suite is effectively a high-end two-bed flat and so is inconsistent with what we want to capture by non en-suite. For that reason, we remove it.
rental_data.loc[(rental_data['city'] == 'Southampton')
& (rental_data['room_type'] == 'Non En-Suite')
& (rental_data['weekly_rent'] > 200)
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 1209 | 663 | Stanley Studios | Southampton | HFS - Prestige Student Living | 204 | 2021 | Non En-Suite | 215.0 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Stanley Studios') &
(rental_data['room_type'] == 'Non En-Suite') &
(rental_data['city'] == 'Southampton')
].index,
inplace = True
)
These en-suite and non en-suite rooms in Dunn House are actually dual occupancy studios and so we remove them.
rental_data.loc[(rental_data['city'] == 'Sunderland')
& (rental_data['asset'] == 'Dunn House')
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 465 | 253 | Dunn House | Sunderland | Cloud Student Homes | 110 | 2012 | En-Suite | 205.0 |
| 466 | 253 | Dunn House | Sunderland | Cloud Student Homes | 110 | 2012 | Non En-Suite | 220.0 |
| 467 | 253 | Dunn House | Sunderland | Cloud Student Homes | 110 | 2012 | Studio | 130.0 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'Dunn House') &
(rental_data['room_type'].isin(['Non En-Suite', 'En-Suite'])) &
(rental_data['city'] == 'Sunderland')
].index,
inplace = True
)
iQ Fiveways House in Wolverhampton has another difficult iQ two-bed apartment. Again, we cannot ascertain whether the price listed is for the the entire apartment or per person. We remove it for consistency.
rental_data.loc[(rental_data['city'] == 'Wolverhampton')
& (rental_data['room_type'] == 'Non En-Suite')
]
| asset_id | asset | city | operator | beds | build_date | room_type | weekly_rent | |
|---|---|---|---|---|---|---|---|---|
| 1668 | 913 | iQ Fiveways House | Wolverhampton | iQ Student Accommodation | 296 | 2003 | Non En-Suite | 150.0 |
rental_data.drop(
rental_data[
(rental_data['asset'] == 'iQ Fiveways House') &
(rental_data['room_type'] == 'Non En-Suite') &
(rental_data['city'] == 'Wolverhampton')
].index,
inplace = True
)
We now examine the weekly_rent in comparison to the categorical variables, starting with the distribution of weekly_rent by room_type.
room_type_sorted = sorted(rental_data.room_type.unique())
ncol = 2
row_dim = int(np.ceil(len(room_type_sorted)/ncol))
plt.clf()
fig, axes = plt.subplots(nrows = row_dim, ncols = ncol, figsize = (10, 6*row_dim))
axes = axes.flatten()
fig.suptitle("Weekly Rent Distribution by Room Type", fontweight = "bold")
for i, room_type in enumerate(room_type_sorted):
ax = axes[i]
sns.histplot(rental_data[rental_data["room_type"] == room_type]["weekly_rent"], bins = 12, ax = ax, color = (0, 0.13, 0.27))
ax.set_title(f"{room_type}", fontweight = "semibold")
ax.set_xlabel("Weekly Rent", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
for i in range(len(room_type_sorted), len(axes)):
axes[i].axis("off")
plt.tight_layout()
plt.subplots_adjust(top = 0.94)
plt.show()
<Figure size 432x288 with 0 Axes>
As we can see, each room type showcases that same positive skew we saw in the wider data set. We now plot the box plots for the weekly_rent by room_type.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
order = ['Non En-Suite', 'En-Suite', 'Studio', 'One Bed']
sns.boxplot(x = rental_data["room_type"],
y = rental_data["weekly_rent"],
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
order = order
)
ax.set_title("Box Plot of the Weekly Rent by Room Type", fontweight = "semibold")
ax.set_xlabel("Room Type", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
As expected, we see that one-beds have the highest median, followed by studios, en-suites, and then non en-suites. We note that the outliers, in a similar fashion to when we plotted the larger data set, are as a result of there being different regional sub-markets pertaining to individual cities.
We now examine the frequency of the different build dates.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.countplot(data = rental_data, x = rental_data["build_date"], color = (0, 0.13, 0.27))
ax.set_title("Frequency of Build Date", fontweight = "semibold")
ax.set_xlabel("Build Date", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
plt.xticks(rotation = 45)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
We note that the build dates is indeed negatively skewed, as first thought from considering the summary statistics. Again, this make sense given the greater exposure investors are obtaining to PBSA of late, with developers incentivised to build PBSA for a 'hot' market with the strengthening fundamentals of rising UK student numbers supporting this decision in recent years.
It appears that the 'boom' in PBSA development occurred between 2014 and 2020, with the number of private PBSA assets in our data set that have opened in the last few years decreasing. Perhaps this is as a result of student growth slowing down and untapped demand decreasing, given the explosion in development, leading to developers looking to other asset classes. Alternatively, the slowdown in development could be a response to the stricter building regulations post-Grenfell disaster.
Irrespective of the underlying causes, the data lies within the expectations we would have for the distribution of the build dates.
We now consider the distribution of the bed numbers.
unique_assets_df = rental_data.drop_duplicates(subset = ["asset_id"])
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(unique_assets_df.beds, color = (0, 0.13, 0.27), bins = 50, kde = True)
ax.set_title("Distribution of Bed Numbers", fontweight = "semibold")
ax.set_xlabel("Beds", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_xlim(left = 0)
plt.show()
<Figure size 432x288 with 0 Axes>
Here we can see a clear positive skew, with most assets being between $0$ and $300$ beds in size. This data will likely need to be transformed before modelling.
We now examine the frequency of different categorical factors in order to further understand the data. We do so by examining the number of beds by category.
beds_by_city = unique_assets_df.groupby('city')['beds'].sum()
beds_by_city = beds_by_city.sort_values(ascending = False)
beds_by_city_df = beds_by_city.reset_index()
beds_by_city_df.columns = ['city', 'total_beds']
plt.clf()
fig, ax = plt.subplots(figsize = (20, 6))
sns.barplot(data = beds_by_city_df,
y = beds_by_city_df["total_beds"],
x = beds_by_city_df['city'],
color = (0, 0.13, 0.27)
)
ax.set_title("Bed Numbers by City", fontweight = "semibold")
ax.set_xlabel("City", fontweight = "medium")
ax.set_ylabel("Beds", fontweight = "medium")
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.xticks(rotation = 90)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
We have that, as expected, and as revealed by the summary statistics, London has the most PBSA beds in the UK, with other large cities not far behind. We note that the number of PBSA beds in London is almost double that of the next city - Sheffield, which highlights big the London PBSA market is.
beds_by_operator = unique_assets_df.groupby('operator')['beds'].sum()
beds_by_operator = beds_by_operator.sort_values(ascending = False)
beds_by_operator_df = beds_by_operator.reset_index()
beds_by_operator_df.columns = ['operator', 'total_beds']
plt.clf()
fig, ax = plt.subplots(figsize = (20, 6))
sns.barplot(data = beds_by_operator_df,
y = beds_by_operator_df["total_beds"],
x = beds_by_operator_df['operator'],
color = (0, 0.13, 0.27)
)
ax.set_title("Bed Numbers by Operator", fontweight = "semibold")
ax.set_xlabel("Operator", fontweight = "medium")
ax.set_ylabel("Beds", fontweight = "medium")
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.xticks(rotation = 90)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Here we see that UNITE Students is by far the largest operator of PBSA in the UK with over $40,000$ beds, according to our data set. We note that there are a few very large operators, such as UNITE, iQ, CRM, Student Roost, and the various Homes For Students brands, with a long tail of very small operators, who likely operate a few assets or less on a local, rather than national, scale.
We further consider all operators with more than $5,000$ beds to get a clearer picture of the largest operators in the market.
beds_by_operator_df_top = beds_by_operator_df[beds_by_operator_df['total_beds'] > 5000]
plt.clf()
fig, ax = plt.subplots(figsize = (20, 6))
sns.barplot(data = beds_by_operator_df_top,
y = beds_by_operator_df_top["total_beds"],
x = beds_by_operator_df_top['operator'],
color = (0, 0.13, 0.27)
)
ax.set_title("Bed Numbers by Largest Operators", fontweight = "semibold")
ax.set_xlabel("Operator", fontweight = "medium")
ax.set_ylabel("Beds", fontweight = "medium")
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.xticks(rotation = 45)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
We note that UNITE Students offer a more basic offering as an operator and have amassed scale at reasonable rents with regard to the surrounding sub-markets. In contrast to this, operators such as Vita Student have considerable portfolios of scale offering some of the best services in the market, and therefore are likely to achieve rental premiums to the surrounding markets as a result of their heightened service offering.
We now have a clearer picture of some of the key features of the data and have ensured accuracy by removing or correcting the offending outliers. We note that from a quantitative perspective, both weekly_rent and beds are positively skewed and likely to require transformations. On the other hand, build_date has a negative skew, which may also need transformation.
We have also considered the different sub-market dynamics by examining the distribution of weekly_rent by city and ascertaining that where there is scale, there is also a positively skewed distribution. Finally, we have considered some of the categorical variables to reveal that the city with the most beds is London and that UNITE Students operate the largest PBSA portfolio in the UK.
Given the first model we wish to attempt to fit to our data is a Linear Regression model, in the next section we consider the suitability of our data set for such an application.
In this section, we will verify that a Linear Regression is indeed a suitable model to apply to this data. We will do so by examining the relationship between the different variables and weekly_rent to see if there is indeed an underlying linear relationship.
We start by examining the relationship between weekly_rent and the other numerical data.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
line_colour = (0.95, 0.65, 0.07)
sns.regplot(x = rental_data["build_date"],
y = rental_data["weekly_rent"],
color = (0, 0.13, 0.27),
line_kws = {'color': line_colour}
)
ax.set_title("Weekly Rent by Build Date", fontweight = "semibold")
ax.set_xlabel("Build Date", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Above we can see that there does indeed appear to be a slight positive relationship between build_date and weekly_rent. This suggests that newer buildings tend to achieve higher rents, which makes logical sense if we consider that newer buildings are more likely to have newer, more modern specification, increased amenity space, and benefit from increased curb appeal.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
line_colour = (0.95, 0.65, 0.07)
sns.regplot(x = rental_data["beds"],
y = rental_data["weekly_rent"],
color = (0, 0.13, 0.27),
line_kws = {'color': line_colour}
)
ax.set_title("Weekly Rent by Beds", fontweight = "semibold")
ax.set_xlabel("Beds", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.show()
<Figure size 432x288 with 0 Axes>
Here we see that the number of beds appears to have a very slight positive correlation with weekly_rent. However, the correlation is so small, it appears to be effectively zero, suggesting that there is no correlation between the size of an asset and the rent it commands.
Whilst some may argue that a larger asset is more likely to have more amenity space, rendering a scheme more attractive, there are other dynamics at play. Firstly, this is not always the case. Secondly, albeit there may be more amenity space, that amenity space is shared between a larger group of people thus reducing the attractiveness of that amenity space. Thirdly, larger schemes are more likely to have a wider array of rooms with a bigger mix of clusters and studios, thus having an increased range of weekly rents, with cheaper rooms bringing down any average obtained from an increase in amenity space. This results in the fairly absent correlation we see above.
Given this lack of relationship, we may consider dropping the variable from our model as it appears to contribute little. Before making any decisions, we further consider whether there is a relationship between the logarithm of beds or, alternatively, if grouping the assets by bed numbers shows some sort of ordinal relationship.
log_beds = rental_data.beds.apply(lambda x: np.log(x))
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
line_colour = (0.95, 0.65, 0.07)
sns.regplot(x = log_beds,
y = rental_data["weekly_rent"],
color = (0, 0.13, 0.27),
line_kws = {'color': line_colour}
)
ax.set_title("Weekly Rent by Beds", fontweight = "semibold")
ax.set_xlabel("Log-Beds", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.show()
<Figure size 432x288 with 0 Axes>
Here we see there is still very little correlation, further supporting dropping the variable if models do not deem it significant.
def classify_beds(beds):
if beds < 100:
return "sub-scale"
elif 100 <= beds < 400:
return "normal"
elif 400 <= beds < 800:
return "large"
else:
return "oversized"
rental_data['bed_class'] = rental_data.beds.apply(classify_beds)
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
order = ['sub-scale', 'normal', 'large', 'oversized']
sns.boxplot(x = rental_data["bed_class"],
y = rental_data["weekly_rent"],
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
order = order
)
ax.set_title("Box Plot of the Weekly Rent by Asset Size", fontweight = "semibold")
ax.set_xlabel("Asset Size", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Above we have classified the assets into 'sub-scale', 'normal', 'large' or 'oversized' categories based upon the number of beds. Still we see very little relationship, if any, adding further credence to dropping the variable. We shall keep the variable in for now and assess its impact on the model once we are evaluating it, perhaps removing it at a later date.
Below we create a correlation matrix between the numerical values to asses the strength of the linear relationships.
quant_rental_data = rental_data[['beds', 'build_date', 'weekly_rent']]
corr_matrix = quant_rental_data.corr()
from matplotlib.colors import LinearSegmentedColormap
colours = [(1, 1, 1), (0, 0.13, 0.27)]
custom_palette = LinearSegmentedColormap.from_list('custom_pale_to_blue', colours)
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.heatmap(corr_matrix, annot = True, cmap = custom_palette, fmt = '.2f')
ax.set_title('Correlation Heatmap of Numerical Variables', fontweight = 'semibold')
plt.show()
<Figure size 432x288 with 0 Axes>
As we can see, there is indeed a weak positive correlation between build_date and weekly_rent of $0.23$, insinuating that there is some evidence to suggest that newer buildings charge a premium.
Furthermore, a correlation of 0.07 between beds and weekly_rent suggests no meaningful correlation. Therefore beds may not significantly contribute to the model.
We now turn our attention to the categorical features - namely room_type, operator, and city.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
room_type_order = ['Non En-Suite', 'En-Suite', 'Studio', 'One Bed']
sns.boxplot(x = rental_data["room_type"],
y = rental_data["weekly_rent"],
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
order = room_type_order
)
ax.set_title("Box Plot of the Weekly Rent by Room Type", fontweight = "semibold")
ax.set_xlabel("Room Type", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Here we see that the categories show a clear separation from one another, especially in the case of one-beds, studios and en-suites. We note that non en-suites and en-suites have a fairly similar profile however the latter does still have slightly higher quartiles.
top_operators = beds_by_operator.head(20).index.tolist()
rental_data_top_operators = rental_data[rental_data.operator.isin(top_operators)]
plt.clf()
fig, ax = plt.subplots(figsize = (14, 6))
sns.boxplot(x = rental_data_top_operators["operator"],
y = rental_data_top_operators["weekly_rent"],
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
order = top_operators
)
ax.set_title("Box Plot of the Weekly Rent By Top Operators", fontweight = "semibold")
ax.set_xlabel("Top Operators by Beds", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation = 90)
plt.show()
<Figure size 432x288 with 0 Axes>
Above we have plotted box plots for the weekly_rent by operator for the $20$ largest operators in the UK. We have selected only these operators as plotting all $100+$ operators would be impractical to draw conclusions from and many of the smaller operators only manage a a few assets meaning the data is low volume and therefore less insightful. Furthermore, by selecting the $20$ largest operators, we are accounting for the data that is most common in our data set and therefore the more likely to be influential.
We note that, even amongst our subset of operators, we can see differences in the ranges and medians achieved. Some of these differences will be explainable by the sub-markets these operators are active in. For example, Chapter London is a brand that exclusively manages assets in London, which explains why their achieved rental tone is so consistently above other operators. UNITE and iQ, on the other hand, operate thousands of beds on a national scale which leads to a larger range at a lower price point on average.
Despite this, and considering that the majority of these operators have assets across the UK, which somewhat mitigates the influence of sub-market dynamics, we can see that there are distinctions in rental tone amongst operators. For example, UNITE and Homes For Students typically provide quite a basic level of service and this is reflected in their lower median and minimum weekly_rents. On the other hand, the aforementioned Chapter London, Scape Student Living, and Vita Student are known for providing a premium service at the top end of the market, which we also see reflected here. This distinction between operator provides credence for the application of a Linear Regression Model.
top_cities = beds_by_city.head(30).index.tolist()
rental_data_top_cities = rental_data[rental_data.city.isin(top_cities)]
plt.clf()
fig, ax = plt.subplots(figsize = (14, 6))
sns.boxplot(x = rental_data_top_cities["city"],
y = rental_data_top_cities["weekly_rent"],
boxprops = dict(edgecolor = (0, 0, 0), facecolor = (0, 0.13, 0.27), alpha = 0.4),
whiskerprops = dict(color = (0, 0, 0)),
medianprops = dict(color = (0, 0, 0)),
order = top_cities
)
ax.set_title("Box Plot of the Weekly Rent By Most-Supplied Cities", fontweight = "semibold")
ax.set_xlabel("Top Cities by Beds", fontweight = "medium")
ax.set_ylabel("Weekly Rent", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.xticks(rotation = 90)
plt.show()
<Figure size 432x288 with 0 Axes>
Similarly to operators, there are too many cities to consider all at once. However, we have looked at the $30$ most supplied cities by beds. Again, this provides the most data points, avoids being too congested to draw conclusions, and benefits from providing insight into the most influential data.
Once again, we see clear regional differences supporting that this variable could be meaningful for a Linear Regression model. Naturally, London sees the highest median weekly_rent with the highest maximum value too. This is to be expected on account of the competing land uses in the nation's capital driving up land prices and impacting rental rates too. This is further compounded by London having nearly $40$ HEIs and therefore increased demand for PBSA.
Furthermore, other chronically undersupplied cities that see higher rents across all residential sectors, such as Bristol and Edinburgh, also see higher rents achieved here. Bristol's undersupplied nature, resulting from draconian planning laws, has driven PBSA rents immensely over the last few years, which we see with the median weekly_rent approaching that of London. This influence of supply-and-demand dynamics is seen further in Glasgow, which benefits from five universities and one of the largest student populations in the UK, leading to attractive supply-and-demand dynamics, which in result drives rents, which we see reflected here.
This plot highlights clearly that there are sub-market dynamics within the UK and this could contribute towards effectively training a Linear Regression model.
Finally, we calculate and compare group medians for the categorical data.
room_types_medians = rental_data.groupby('room_type')['weekly_rent'].median()
for room_type in room_type_order:
print(f'The median rent for {room_type}s is: {room_types_medians[room_type]:.2f}.')
The median rent for Non En-Suites is: 159.00. The median rent for En-Suites is: 179.00. The median rent for Studios is: 238.00. The median rent for One Beds is: 290.00.
Here we can see notable differences between the different classifications of room_type, which further suggests that the variable could be used to train a Linear Regression model.
operators_medians = rental_data.groupby('operator')['weekly_rent'].median()
operator_order = beds_by_operator.index.tolist()
for operator in operator_order:
print(f'The median rent achieved by {operator} is: {operators_medians[operator]:.2f}.')
The median rent achieved by UNITE Students is: 204.00. The median rent achieved by iQ Student Accommodation is: 270.50. The median rent achieved by CRM Students is: 217.50. The median rent achieved by Student Roost is: 224.00. The median rent achieved by HFS - Homes For Students is: 185.00. The median rent achieved by Fresh Student Living is: 219.00. The median rent achieved by HFS - Prestige Student Living is: 239.00. The median rent achieved by Vita Student is: 309.75. The median rent achieved by Collegiate AC is: 211.00. The median rent achieved by Yugo is: 200.00. The median rent achieved by Hello Student is: 208.50. The median rent achieved by Student Castle is: 193.00. The median rent achieved by Host Students is: 197.00. The median rent achieved by Chapter London is: 406.50. The median rent achieved by Canvas Student is: 245.25. The median rent achieved by Downing Students is: 250.00. The median rent achieved by True Student is: 239.75. The median rent achieved by Scape Student Living is: 344.50. The median rent achieved by Abodus Student Living is: 260.00. The median rent achieved by Novel Student is: 272.00. The median rent achieved by Campus Living Villages is: 153.00. The median rent achieved by Mezzino is: 163.25. The median rent achieved by Urbanest is: 389.00. The median rent achieved by Aparto Student is: 274.00. The median rent achieved by Here Students is: 255.00. The median rent achieved by Cloud Student Homes is: 138.50. The median rent achieved by Code Student Accommodation is: 180.00. The median rent achieved by Future Generation Asset Management Limited is: 208.50. The median rent achieved by Dwell Student Living is: 226.50. The median rent achieved by Now Students is: 226.75. The median rent achieved by Sanctuary Student is: 151.00. The median rent achieved by HFS - Urban Student Life is: 155.25. The median rent achieved by Every Student is: 126.50. The median rent achieved by X1 Lettings is: 170.00. The median rent achieved by Xenia Students is: 195.00. The median rent achieved by Mansion Student is: 179.00. The median rent achieved by Axo Student Living is: 160.00. The median rent achieved by Study Inn is: 232.50. The median rent achieved by LIVStudent is: 184.00. The median rent achieved by HFS - Universal Student Living is: 185.00. The median rent achieved by Urban Sleep is: 199.00. The median rent achieved by Prime Student Living is: 238.00. The median rent achieved by City Estates is: 179.00. The median rent achieved by Student Facility Management is: 197.50. The median rent achieved by Luna Students is: 227.00. The median rent achieved by Almero Student Mansions is: 213.00. The median rent achieved by DIGS Student is: 121.50. The median rent achieved by Unipol is: 172.50. The median rent achieved by IconInc is: 274.50. The median rent achieved by Propeller Lettings is: 148.50. The median rent achieved by Kexgill Student Accommodation is: 154.50. The median rent achieved by YPP is: 181.73. The median rent achieved by Derwent Students is: 250.00. The median rent achieved by The Social Hub is: 245.00. The median rent achieved by Primo Property Management is: 136.25. The median rent achieved by Student Cribs is: 201.75. The median rent achieved by Nurtur Student Living is: 185.50. The median rent achieved by Ashcourt is: 174.25. The median rent achieved by City Block is: 257.00. The median rent achieved by UniLife is: 330.00. The median rent achieved by Allied Student Accommodation is: 222.50. The median rent achieved by Bee Hive - Harington Investments is: 203.75. The median rent achieved by Days Letting is: 222.50. The median rent achieved by Megaclose Ltd is: 204.50. The median rent achieved by N Joy Student Living is: 145.00. The median rent achieved by HFS - Evo Student is: 230.00. The median rent achieved by u-student is: 132.00. The median rent achieved by Beyond The Box Student Ltd is: 149.25. The median rent achieved by Aspire Student Lettings is: 187.50. The median rent achieved by Mapleisle Ltd is: 153.75. The median rent achieved by Warehouse Students Ltd is: 115.20. The median rent achieved by SPACE Student Accomodation is: 367.50. The median rent achieved by Project Student is: 160.00. The median rent achieved by Bailrigg Student Living is: 168.50. The median rent achieved by Pennycuick Collins is: 166.50. The median rent achieved by Premier Student Halls is: 233.50. The median rent achieved by Vanilla Lettings is: 114.00. The median rent achieved by Stanton Asset Management is: 160.50. The median rent achieved by Volume Property is: 265.00. The median rent achieved by CPS Homes is: 161.00. The median rent achieved by Living Worcester Group is: 148.08. The median rent achieved by Gather Students is: 201.00. The median rent achieved by Key Let is: 178.00. The median rent achieved by Lulworth Student Company is: 198.75. The median rent achieved by Stoke Student Living is: 136.25. The median rent achieved by HFS - Essential Student Living is: 230.00. The median rent achieved by Fenton Property Holdings is: 165.00. The median rent achieved by Caro Student Living is: 147.50. The median rent achieved by Graysons Properties is: 120.00. The median rent achieved by ASN Capital is: 245.00. The median rent achieved by Metro Student Accommodation is: 127.50. The median rent achieved by Find Digs is: 265.00. The median rent achieved by Student Letting Company is: 215.00. The median rent achieved by Bagri Foundation is: 205.00. The median rent achieved by Carvels Lettings is: 215.00. The median rent achieved by UniHouse is: 290.00. The median rent achieved by APS Property Group Ltd is: 137.00. The median rent achieved by Unest is: 87.50. The median rent achieved by Purple Frog Property Ltd is: 197.50. The median rent achieved by Stay Clever is: 142.50. The median rent achieved by Heathfield Norwich Limited is: 135.00. The median rent achieved by Yellow Door Lets is: 160.00. The median rent achieved by Apex Student Living is: 127.50. The median rent achieved by Manor Villages is: 144.00. The median rent achieved by East Of Exe is: 317.50. The median rent achieved by Oak Student Lets is: 197.00.
Again, we see plenty of variation between the median achieved rents. In some cases, this variation is rather stark, again supporting that this factor may be influential.
city_medians = rental_data.groupby('city')['weekly_rent'].median()
city_order = beds_by_city.index.tolist()
for city in city_order:
print(f'The median rent achieved in {city} is: {city_medians[city]:.2f}.')
The median rent achieved in London is: 389.00. The median rent achieved in Sheffield is: 174.17. The median rent achieved in Liverpool is: 170.00. The median rent achieved in Nottingham is: 201.00. The median rent achieved in Coventry is: 168.75. The median rent achieved in Leeds is: 229.00. The median rent achieved in Birmingham is: 227.00. The median rent achieved in Manchester is: 260.50. The median rent achieved in Newcastle is: 188.50. The median rent achieved in Glasgow is: 260.00. The median rent achieved in Leicester is: 165.00. The median rent achieved in Cardiff is: 190.00. The median rent achieved in Edinburgh is: 281.75. The median rent achieved in Southampton is: 227.00. The median rent achieved in Portsmouth is: 203.00. The median rent achieved in Aberdeen is: 152.00. The median rent achieved in Exeter is: 246.00. The median rent achieved in Swansea is: 194.50. The median rent achieved in Preston is: 121.00. The median rent achieved in Plymouth is: 162.00. The median rent achieved in Bristol is: 367.50. The median rent achieved in York is: 267.00. The median rent achieved in Lincoln is: 159.00. The median rent achieved in Belfast is: 207.00. The median rent achieved in Bradford is: 107.00. The median rent achieved in Huddersfield is: 141.00. The median rent achieved in Canterbury is: 196.50. The median rent achieved in Bournemouth is: 222.50. The median rent achieved in Brighton is: 320.00. The median rent achieved in Lancaster is: 184.50. The median rent achieved in Norwich is: 194.00. The median rent achieved in Cambridge is: 250.00. The median rent achieved in Stoke On Trent is: 154.00. The median rent achieved in Salford is: 224.50. The median rent achieved in Oxford is: 274.50. The median rent achieved in Guildford is: 260.00. The median rent achieved in Chester is: 192.00. The median rent achieved in Bath is: 295.00. The median rent achieved in Durham is: 227.75. The median rent achieved in Loughborough is: 195.00. The median rent achieved in Kingston upon Thames is: 305.00. The median rent achieved in Reading is: 267.50. The median rent achieved in Dundee is: 186.00. The median rent achieved in Colchester is: 184.00. The median rent achieved in Wolverhampton is: 103.50. The median rent achieved in Medway is: 166.75. The median rent achieved in Winchester is: 249.00. The median rent achieved in Sunderland is: 128.75. The median rent achieved in Derby is: 162.00. The median rent achieved in Ipswich is: 179.25. The median rent achieved in St Andrews is: 293.50. The median rent achieved in Bangor is: 155.75. The median rent achieved in Hull is: 135.25. The median rent achieved in Stirling is: 220.00. The median rent achieved in Newport (Wales) is: 141.75. The median rent achieved in Falmouth is: 215.50. The median rent achieved in Bolton is: 115.00. The median rent achieved in Stockton & Middlesbrough is: 138.75. The median rent achieved in Paisley is: 130.00. The median rent achieved in Warwick is: 210.00. The median rent achieved in Wrexham is: 117.00. The median rent achieved in Worcester is: 148.08. The median rent achieved in Carlisle is: 93.00. The median rent achieved in Luton is: 127.50.
Again, we see that some cities, such as London, Bristol, and Edinburgh, achieve high median weekly_rents. This stands in comparison to other cities, such as Carlisle, Bolton, and Bradford, who achieve much lower rates. This lends itself to the assumption that regional factors can influence rental rates and be significant predictors in a Linear Regression model.
In conclusion, we have established that a Linear Regression model would be a reasonable approach. Our analysis revealed that there is a weak positive correlation between build_date and weekly_rent, suggesting that newer buildings charge a premium. We also found there to be no significant correlation between the size of an asset and the weekly_rent, suggesting that this feature will be broadly insignificant for training a model. However, analysis of the categorical variables of room_type, operator, and city suggested that there are clear differences between these categories, which in turn supports that they will be influential in training a linear model.
We now consider the transformations and encoding needed to prepare this data for modelling.
In this section, we shall consider each of the variables and ensure it is ready for the model to be trained upon it. This will involve considering any transformations or encoding that needs to take place and the rationale behind said changes.
Before we transform or encode our data, we are going to split our data into a training set and a test set. Splitting our data provides us with the ability to test our trained model on a completely unseen data set. We perform the split now to reduce data leakage and ensure any transformations we make to the data is done independently of the test set, which should be treated like a completely unseen set of data.
from sklearn.model_selection import train_test_split
X = rental_data.drop(columns = ['weekly_rent', 'asset', 'bed_class'])
y = rental_data['weekly_rent']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 14)
We start by considering the numerical data variables - build_date and beds on the independent variable side and weekly_rent on the dependent variable side. We start with weekly_rent.
As we saw in the distribution of weekly_rent in Section 3, the weekly_rent data is quite positively skewed. We check this for the training data below.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(y_train, kde = True, color = (0, 0.13, 0.27))
ax.set_title("Distribution of the Training Set Weekly Rent", fontweight = "semibold")
ax.set_xlabel("Weekly Rent", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Clearly, we have maintained that positive skew following the splitting of the data. We therefore elect to apply a log transformation as this will compress the data, reducing the affect of the extreme outliers that gives the data its positively skewed shape. Since none of the rents are $0$, we apply a simple transformation $f(x)$, where $f(x) = \log(x)$.
transformed_y_train = y_train.apply(lambda x: np.log(x))
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(transformed_y_train, kde = True, color = (0, 0.13, 0.27))
ax.set_title("Log-Distribution of the Weekly Rent", fontweight = "semibold")
ax.set_xlabel("Log-Weekly Rent", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Clearly, the data is now more normal and suitable for Linear Regression.
Similarly, we had that the beds feature was positively skewed. We consider the distribution below for just a unique set of assets to ensure we are not double counting anything.
unique_assets = X_train.drop_duplicates(subset = 'asset_id')
unique_train_beds = unique_assets['beds']
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(unique_train_beds, kde = True, color = (0, 0.13, 0.27), bins = 30)
ax.set_title("Distribution of the Training Set Beds", fontweight = "semibold")
ax.set_xlabel("Beds", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Again, we see that beds is positively skewed. We note that beds is actually a discrete variable and not continuous. Therefore, traditionally we would encode it using one of One-Hot Encoding ("OHE"), Target Encoding, or Ordinal Encoding, which we explain in more detail below.
However, given the size of the data set, the range of the variable, and the evidenced skewness, we are going to treat it as a continuous variable and apply a log transformation.
transformed_X_train = X_train.copy()
transformed_X_train['beds'] = transformed_X_train.beds.apply(lambda x: np.log(x))
unique_assets = transformed_X_train.drop_duplicates(subset = 'asset_id')
unique_log_train_beds = unique_assets['beds']
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(unique_log_train_beds, kde = True, color = (0, 0.13, 0.27))
ax.set_title("Log-Distribution of the Beds", fontweight = "semibold")
ax.set_xlabel("Log-Beds", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Again, this data is slightly better, however it still possible that this variable will be dropped as it is unlikely to influence the model, given the work we did in Section 4.
Finally, we consider the build_date, which is also a categoric variable. We elect to turn this into a continuous variable 'age' so that the model can more easily interpret the jumps between years.
from datetime import datetime
current_year = datetime.now().year
transformed_X_train['age'] = current_year - transformed_X_train['build_date']
transformed_X_train.drop(columns = ['build_date'], inplace = True)
We now consider the distribution of the age variable.
unique_assets = transformed_X_train.drop_duplicates(subset = 'asset_id')
unique_train_ages = unique_assets['age']
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(unique_train_ages, kde = True, color = (0, 0.13, 0.27), bins = 32)
ax.set_title("Distribution of the Training Set Asset Ages", fontweight = "semibold")
ax.set_xlabel("Age", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
Here we can see the data shows a slight positive skew. Furthermore, as we have now reframed this data as continuous, we can treat it as such and apply a log transformation as above to reduce the skewness.
We note that there are some assets with age $0$ and thus use the transformation $f(x) = \log(x+1)$ as $\log(0)$ is not defined.
transformed_X_train = transformed_X_train.copy()
log_age = transformed_X_train.age.apply(lambda x: np.log(x + 1))
unique_assets = transformed_X_train.drop_duplicates(subset = 'asset_id')
unique_log_train_ages = log_age
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(unique_log_train_ages, kde = True, color = (0, 0.13, 0.27))
ax.set_title("Log-Distribution of the Ages", fontweight = "semibold")
ax.set_xlabel("Log-Age", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
We notice that the distribution is quite sparse now and actually appears to be slightly negatively skewed. This suggests the original distribution was not that positively skewed and thus the log transformation is too severe. We consider a slightly less aggressive square root transformation where $f(x) = \sqrt{x}$ instead.
transformed_X_train = transformed_X_train.copy()
sqrt_age = transformed_X_train.age.apply(lambda x: np.sqrt(x))
unique_assets = transformed_X_train.drop_duplicates(subset = 'asset_id')
unique_sqrt_train_ages = sqrt_age
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(unique_sqrt_train_ages, kde = True, color = (0, 0.13, 0.27), bins = 15)
ax.set_title("Square Root Distribution of the Ages", fontweight = "semibold")
ax.set_xlabel("Square Root Age", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
We note that whilst this isn't perfect, it has normalised the data somewhat. We also note that our work in Section 4 already highlighted that there was a weak linear relationship between the build_date variable and the weekly_rent, which should be understood by the Linear Regression model.
transformed_X_train = transformed_X_train.copy()
transformed_X_train['age'] = transformed_X_train.age.apply(lambda x: np.sqrt(x))
Now that we have transformed the numerical features, we turn our attention to the categorical features, starting with room_type. Given it is a categorical variable, we need to encode the data. One type of encoding is OHE, where we create $n$ columns (or $n-1$) for the $n$ categories in our variable, where the column has a $1$ if the row is of that category and a $0$ otherwise. This is good as it captures all the granularity of the data albeit can increase dimensionality by adding extra columns. Another option is Ordinal Encoding, where we assume an order and assign a number $1$ through $n$ to each category. Another option could be Target Encoding, where we replace the categorical label with a number such as the mean or median of that category. This keeps a level of information about the variable and has a lower dimensionality in comparison to OHE.
Whilst the box plots created earlier show that there is indeed an ordinal pattern to the variable room_type, with non en-suites having the lowest median weekly_rent and one-beds having the highest, we note that there was not an extremely clear distinction between non en-suites and en-suites. Furthermore, given that there are only four categories for this variable, we will not be increasing the dimensionality too much if we use OHE. For those reasons, we will use OHE for the room_type variable.
'''
drop_first = True removes the first column as the absence of a one suggests the row is in this category by elimination.
This helps reduce dimensionality somewhat.
'''
transformed_X_train = pd.get_dummies(transformed_X_train, columns = ["room_type"], drop_first = True)
We now consider the city variable. We note that there are $63$ different cities in the training set and that many cities do not have many data points, as seen from the bar chart in Section 3.
city_freq = X_train.city.value_counts()
city_freq_list = city_freq.index.tolist()
for city in city_freq_list:
print(f'There is/are {city_freq[city]} sample(s) for {city}.')
There is/are 134 sample(s) for London. There is/are 94 sample(s) for Sheffield. There is/are 90 sample(s) for Nottingham. There is/are 65 sample(s) for Liverpool. There is/are 62 sample(s) for Leeds. There is/are 56 sample(s) for Birmingham. There is/are 52 sample(s) for Newcastle. There is/are 51 sample(s) for Edinburgh. There is/are 50 sample(s) for Coventry. There is/are 48 sample(s) for Glasgow. There is/are 44 sample(s) for Exeter. There is/are 39 sample(s) for Leicester. There is/are 37 sample(s) for Manchester. There is/are 36 sample(s) for Aberdeen. There is/are 34 sample(s) for Cardiff. There is/are 32 sample(s) for Southampton. There is/are 29 sample(s) for Plymouth. There is/are 26 sample(s) for Bristol. There is/are 21 sample(s) for Lancaster. There is/are 21 sample(s) for Canterbury. There is/are 20 sample(s) for Loughborough. There is/are 20 sample(s) for Brighton. There is/are 19 sample(s) for Preston. There is/are 18 sample(s) for Portsmouth. There is/are 15 sample(s) for Bournemouth. There is/are 15 sample(s) for York. There is/are 14 sample(s) for Stoke On Trent. There is/are 12 sample(s) for Norwich. There is/are 12 sample(s) for Swansea. There is/are 12 sample(s) for Lincoln. There is/are 11 sample(s) for Oxford. There is/are 11 sample(s) for Cambridge. There is/are 10 sample(s) for Kingston upon Thames. There is/are 10 sample(s) for Belfast. There is/are 10 sample(s) for Chester. There is/are 10 sample(s) for Guildford. There is/are 9 sample(s) for Bath. There is/are 9 sample(s) for Salford. There is/are 9 sample(s) for Huddersfield. There is/are 8 sample(s) for Colchester. There is/are 8 sample(s) for Reading. There is/are 8 sample(s) for Dundee. There is/are 7 sample(s) for Bradford. There is/are 7 sample(s) for Sunderland. There is/are 6 sample(s) for Winchester. There is/are 6 sample(s) for Stirling. There is/are 5 sample(s) for Bolton. There is/are 5 sample(s) for Derby. There is/are 4 sample(s) for Wolverhampton. There is/are 4 sample(s) for Bangor. There is/are 4 sample(s) for St Andrews. There is/are 4 sample(s) for Hull. There is/are 3 sample(s) for Durham. There is/are 3 sample(s) for Falmouth. There is/are 2 sample(s) for Carlisle. There is/are 2 sample(s) for Warwick. There is/are 2 sample(s) for Medway. There is/are 2 sample(s) for Worcester. There is/are 1 sample(s) for Paisley. There is/are 1 sample(s) for Wrexham. There is/are 1 sample(s) for Stockton & Middlesbrough. There is/are 1 sample(s) for Newport (Wales). There is/are 1 sample(s) for Luton. There is/are 1 sample(s) for Ipswich.
Again we have similar encoding options available to us as for room_type. There are too many unique variables here to use OHE explicitly and whilst we could piece together some ordinal data, it would be quicker to use Target Encoding. So, we have Target Encoding as a viable option. Alternatively, we could reduce the dimensionality and still use OHE by assigning each city to one of $12$ regions in the UK, namely: Scotland, Northern Ireland, Wales, North West, North East, Yorkshire, West Midlands, East Midlands, East, South West, London, or South East. That would leave us with $12$ categories, which after dropping the first column results in an increased dimensionality of $11$ columns, which is broadly acceptable.
Both methods have their merits. Target Encoding enables us to maintain our city-level granularity and ensures a lower dimensionality in comparison to OHE. However, given that $13$ cities have five or fewer samples, there is a chance that the model overfits and does not generalise well. On the other hand, OHE smoothens the data and relies upon the assumption that cities in similar regions of the UK achieve similar rents, which we can see from the box plot in Section 4 is not always the case. For example Edinburgh and Aberdeen would both be grouped into the Scotland category, but clearly have different rental distributions.
Given this difference within regions, we will use Target Encoding. However, we will smooth out the city medians calculated with the UK median from the data set, in an attempt to avoid overfitting. We start by calculating both the city_median and the UK_median, before then weighting the two to have a value to encode. We note that the weekly_rent has been log transformed and so this will affect these calculations.
UK_median_rent = transformed_y_train.median()
print(f'The UK median weekly_rent is {UK_median_rent:.2f}.')
The UK median weekly_rent is 5.35.
transformed_training_data = transformed_X_train.copy()
transformed_training_data['weekly_rent'] = transformed_y_train
training_city_median_rents = transformed_training_data.groupby('city')['weekly_rent'].median().rename('median_rent')
training_city_median_rents.sort_values(ascending = False, inplace = True)
city_order = training_city_median_rents.index.tolist()
for city in city_order:
print(f'The median rent achieved in {city} is: {training_city_median_rents[city]:.2f}.')
The median rent achieved in London is: 5.96. The median rent achieved in Bristol is: 5.88. The median rent achieved in St Andrews is: 5.82. The median rent achieved in Brighton is: 5.78. The median rent achieved in Kingston upon Thames is: 5.73. The median rent achieved in Reading is: 5.64. The median rent achieved in Oxford is: 5.62. The median rent achieved in Edinburgh is: 5.61. The median rent achieved in York is: 5.59. The median rent achieved in Manchester is: 5.57. The median rent achieved in Falmouth is: 5.56. The median rent achieved in Guildford is: 5.55. The median rent achieved in Glasgow is: 5.55. The median rent achieved in Durham is: 5.54. The median rent achieved in Winchester is: 5.52. The median rent achieved in Bath is: 5.50. The median rent achieved in Cambridge is: 5.48. The median rent achieved in Exeter is: 5.48. The median rent achieved in Leeds is: 5.46. The median rent achieved in Birmingham is: 5.43. The median rent achieved in Southampton is: 5.42. The median rent achieved in Salford is: 5.41. The median rent achieved in Bournemouth is: 5.40. The median rent achieved in Stirling is: 5.40. The median rent achieved in Dundee is: 5.40. The median rent achieved in Portsmouth is: 5.35. The median rent achieved in Warwick is: 5.33. The median rent achieved in Belfast is: 5.33. The median rent achieved in Norwich is: 5.33. The median rent achieved in Nottingham is: 5.30. The median rent achieved in Canterbury is: 5.28. The median rent achieved in Cardiff is: 5.27. The median rent achieved in Swansea is: 5.27. The median rent achieved in Newcastle is: 5.23. The median rent achieved in Lancaster is: 5.22. The median rent achieved in Colchester is: 5.21. The median rent achieved in Loughborough is: 5.20. The median rent achieved in Sheffield is: 5.18. The median rent achieved in Coventry is: 5.15. The median rent achieved in Liverpool is: 5.15. The median rent achieved in Chester is: 5.14. The median rent achieved in Medway is: 5.11. The median rent achieved in Leicester is: 5.11. The median rent achieved in Plymouth is: 5.09. The median rent achieved in Derby is: 5.09. The median rent achieved in Lincoln is: 5.07. The median rent achieved in Bangor is: 5.06. The median rent achieved in Worcester is: 5.05. The median rent achieved in Stoke On Trent is: 5.04. The median rent achieved in Aberdeen is: 5.03. The median rent achieved in Ipswich is: 5.03. The median rent achieved in Huddersfield is: 4.98. The median rent achieved in Stockton & Middlesbrough is: 4.98. The median rent achieved in Newport (Wales) is: 4.95. The median rent achieved in Luton is: 4.94. The median rent achieved in Hull is: 4.91. The median rent achieved in Paisley is: 4.87. The median rent achieved in Sunderland is: 4.85. The median rent achieved in Preston is: 4.80. The median rent achieved in Wrexham is: 4.76. The median rent achieved in Bolton is: 4.74. The median rent achieved in Bradford is: 4.67. The median rent achieved in Wolverhampton is: 4.64. The median rent achieved in Carlisle is: 4.47.
We now wish to smooth these median rents to account for small sample sizes. We shall create a function that takes the UK median and the city median and weights the two by a weighting influenced by the size of the sample we have for an individual city. We shall then append this information to our training data.
def smooth_medians(global_median, local_median, local_sample_size, global_weighting):
return (local_sample_size*local_median + global_weighting*global_median) / (local_sample_size + global_weighting)
training_city_median_rents = training_city_median_rents.to_frame().reset_index()
training_city_median_rents['city_sample_size'] = training_city_median_rents.city.map(city_freq)
UK_rent_weighting = 5
training_city_median_rents['smoothed_median_rent'] = training_city_median_rents.apply(
lambda row: smooth_medians(UK_median_rent, row['median_rent'], row['city_sample_size'], UK_rent_weighting),
axis = 1
)
city_to_rent_dict = training_city_median_rents.set_index('city')['smoothed_median_rent'].to_dict()
transformed_X_train['city'] = transformed_X_train.city.map(city_to_rent_dict)
We now have an encoded city variable for the training data.
Our final categorical variable to encode is the operator variable. Similarly to city, we can not just OHE this variable as there are over $100$ different operators and this will add too much dimensionality. This leaves two options, similarly to city. We can either group the operators into categories and OHE this lower dimensionality variable, or we can use Target Encoding in a similar manner to the above.
Irrespective of the method we choose, we need a way of classifying the operators numerically. One way we could achieve this is by considering the median weekly_rent achieved by each operator. This would work, however there are some operators, such as Chapter London and Urbanest, who have London-only PBSA brands. This would see them achieve much higher rents than a regional brand of equal operational prestige, solely on account of the city. Therefore, we need to account for this. Similarly, we need to account for the room_type as an operator with solely one-beds will achieve a higher rent than an operator with just non en-suites, even if they are operating in the same city.
Therefore, to isolate the effect an operator is having on the weekly_rent being achieved, we need to calculate some sort of premium on a city and room_type level for each operator. We can do so by comparing the median weekly_rent achieved by an operator in a certain city for a certain room_type to the median weekly_rent achieved by all operators across that city and room_type. This gives us a multiplicative factor that reflects the operator is contributing to the weekly_rent (premium $\gt 1$) or having a negative effect on the weekly_rent (premium $\lt 1$).
Below, we calculate the premiums for each operator on a city and room_type level. We start by adding back in the city_name column and the room_type column as well as the weekly_rent data in order to produce the calculations.
premium_df = transformed_X_train.join(X_train[['city', 'room_type']], how = 'left', rsuffix = '_name')
premium_df['weekly_rent'] = transformed_y_train
city_room_median = premium_df.groupby(['city_name', 'room_type'])['weekly_rent'].median().rename('city_room_median')
premium_df = premium_df.merge(city_room_median, on = ['city_name', 'room_type'])
premium_df['asset_premium'] = premium_df['weekly_rent'] / premium_df['city_room_median']
We have the following average premiums by operator.
operator_premium = premium_df.groupby('operator')['asset_premium'].mean().reset_index()
operator_premium.sort_values(by = 'asset_premium', ascending = False, inplace = True)
for operator, premium in zip(operator_premium['operator'], operator_premium['asset_premium']):
print(f'{operator}: {premium:.4f}')
Propeller Lettings: 1.0534 City Block: 1.0532 Vita Student: 1.0423 Novel Student: 1.0410 UniHouse: 1.0397 Aspire Student Lettings: 1.0394 IconInc: 1.0383 UniLife: 1.0382 Study Inn: 1.0352 Volume Property: 1.0298 East Of Exe: 1.0289 Urban Sleep: 1.0289 True Student: 1.0237 Urbanest: 1.0198 SPACE Student Accomodation: 1.0197 HFS - Evo Student: 1.0192 Now Students: 1.0182 iQ Student Accommodation: 1.0176 Prime Student Living: 1.0176 ASN Capital: 1.0150 Future Generation Asset Management Limited: 1.0142 Luna Students: 1.0136 Scape Student Living: 1.0133 Downing Students: 1.0126 Stoke Student Living: 1.0110 Carvels Lettings: 1.0086 Student Roost: 1.0077 HFS - Prestige Student Living: 1.0071 Aparto Student: 1.0070 The Social Hub: 1.0061 Collegiate AC: 1.0057 Nurtur Student Living: 1.0051 Bee Hive - Harington Investments: 1.0051 Kexgill Student Accommodation: 1.0044 Chapter London: 1.0043 Abodus Student Living: 1.0042 Bailrigg Student Living: 1.0035 Ashcourt: 1.0032 LIVStudent: 1.0022 Student Facility Management: 1.0008 Student Cribs: 1.0003 u-student: 1.0002 Unest: 1.0000 Living Worcester Group: 1.0000 APS Property Group Ltd: 1.0000 Bagri Foundation: 1.0000 Apex Student Living: 1.0000 Canvas Student: 0.9999 Mapleisle Ltd: 0.9997 Mezzino: 0.9990 Fresh Student Living: 0.9986 Metro Student Accommodation: 0.9985 Almero Student Mansions: 0.9984 HFS - Universal Student Living: 0.9976 Hello Student: 0.9969 Yugo: 0.9966 Here Students: 0.9966 UNITE Students: 0.9940 CRM Students: 0.9938 Manor Villages: 0.9932 Every Student: 0.9930 Premier Student Halls: 0.9929 DIGS Student: 0.9928 Host Students: 0.9928 Megaclose Ltd: 0.9924 Purple Frog Property Ltd: 0.9916 Fenton Property Holdings: 0.9914 Student Castle: 0.9910 Cloud Student Homes: 0.9902 Vanilla Lettings: 0.9893 Axo Student Living: 0.9868 Graysons Properties: 0.9862 City Estates: 0.9854 HFS - Homes For Students: 0.9854 Student Letting Company: 0.9849 Xenia Students: 0.9848 Allied Student Accommodation: 0.9840 Mansion Student: 0.9839 Sanctuary Student: 0.9831 X1 Lettings : 0.9825 Code Student Accommodation: 0.9815 HFS - Urban Student Life: 0.9800 Warehouse Students Ltd: 0.9794 YPP: 0.9784 Stanton Asset Management: 0.9781 Oak Student Lets: 0.9775 Yellow Door Lets: 0.9766 Stay Clever: 0.9765 Unipol: 0.9756 Lulworth Student Company: 0.9751 Dwell Student Living: 0.9746 Caro Student Living: 0.9726 Derwent Students: 0.9725 Gather Students: 0.9700 Heathfield Norwich Limited: 0.9598 Beyond The Box Student Ltd: 0.9568 Project Student: 0.9555 Primo Property Management: 0.9548 Campus Living Villages: 0.9515 N Joy Student Living: 0.9453 Key Let: 0.9352 Find Digs : 0.9181
We now plot these premiums to get an idea for the distribution of this feature.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(operator_premium['asset_premium'], kde = True, color = (0, 0.13, 0.27), bins = 18)
ax.set_title("Asset Premiums", fontweight = "semibold")
ax.set_xlabel("Asset Premium", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
We note that, given the premiums were calculated using the already log transformed data, that the data is broadly normally distributed with a fatter right tail than we would expect.
Before we map this data onto the operator value to complete the Target Encoding, we wish to smooth out the encoded values using a similar approach to how we encoded the city variable.
In order to do this, we require a UK average premium across all operators. We will then weight the premium depending upon the sample sizes for each operator.
operator_freq = transformed_X_train.operator.value_counts()
operator_freq_list = operator_freq.index.tolist()
for operator in operator_freq_list:
print(f'There is/are {operator_freq[operator]} sample(s) for {operator}.')
There is/are 112 sample(s) for UNITE Students. There is/are 99 sample(s) for Hello Student. There is/are 99 sample(s) for iQ Student Accommodation. There is/are 97 sample(s) for CRM Students. There is/are 69 sample(s) for HFS - Homes For Students. There is/are 68 sample(s) for Student Roost. There is/are 68 sample(s) for HFS - Prestige Student Living. There is/are 60 sample(s) for Fresh Student Living. There is/are 40 sample(s) for Collegiate AC. There is/are 34 sample(s) for Host Students. There is/are 34 sample(s) for Yugo. There is/are 28 sample(s) for Student Castle. There is/are 26 sample(s) for Vita Student. There is/are 21 sample(s) for Chapter London. There is/are 20 sample(s) for Novel Student. There is/are 19 sample(s) for Every Student. There is/are 19 sample(s) for Downing Students. There is/are 19 sample(s) for HFS - Urban Student Life. There is/are 18 sample(s) for Cloud Student Homes. There is/are 18 sample(s) for Canvas Student. There is/are 17 sample(s) for Mezzino. There is/are 17 sample(s) for Xenia Students. There is/are 16 sample(s) for Abodus Student Living. There is/are 14 sample(s) for Aparto Student. There is/are 14 sample(s) for True Student. There is/are 13 sample(s) for Scape Student Living. There is/are 13 sample(s) for Future Generation Asset Management Limited. There is/are 12 sample(s) for Bee Hive - Harington Investments. There is/are 12 sample(s) for HFS - Universal Student Living. There is/are 11 sample(s) for City Estates. There is/are 11 sample(s) for Study Inn. There is/are 10 sample(s) for Dwell Student Living. There is/are 10 sample(s) for Urbanest. There is/are 10 sample(s) for Mansion Student. There is/are 9 sample(s) for Now Students. There is/are 9 sample(s) for X1 Lettings . There is/are 9 sample(s) for Campus Living Villages. There is/are 8 sample(s) for Primo Property Management. There is/are 8 sample(s) for Here Students. There is/are 8 sample(s) for Kexgill Student Accommodation. There is/are 7 sample(s) for Prime Student Living. There is/are 7 sample(s) for Sanctuary Student. There is/are 6 sample(s) for IconInc. There is/are 6 sample(s) for Urban Sleep. There is/are 6 sample(s) for Allied Student Accommodation. There is/are 6 sample(s) for Almero Student Mansions. There is/are 6 sample(s) for City Block. There is/are 5 sample(s) for Student Facility Management. There is/are 5 sample(s) for UniLife. There is/are 5 sample(s) for Project Student. There is/are 5 sample(s) for YPP. There is/are 5 sample(s) for Student Cribs. There is/are 4 sample(s) for Gather Students. There is/are 4 sample(s) for DIGS Student. There is/are 4 sample(s) for Metro Student Accommodation. There is/are 3 sample(s) for Unipol. There is/are 3 sample(s) for Fenton Property Holdings. There is/are 3 sample(s) for Axo Student Living. There is/are 3 sample(s) for Derwent Students. There is/are 3 sample(s) for Premier Student Halls. There is/are 3 sample(s) for Key Let. There is/are 3 sample(s) for LIVStudent. There is/are 3 sample(s) for Megaclose Ltd. There is/are 2 sample(s) for Propeller Lettings. There is/are 2 sample(s) for Heathfield Norwich Limited. There is/are 2 sample(s) for Volume Property. There is/are 2 sample(s) for Carvels Lettings. There is/are 2 sample(s) for N Joy Student Living. There is/are 2 sample(s) for Nurtur Student Living. There is/are 2 sample(s) for Aspire Student Lettings. There is/are 2 sample(s) for Living Worcester Group. There is/are 2 sample(s) for Caro Student Living. There is/are 2 sample(s) for Find Digs . There is/are 2 sample(s) for SPACE Student Accomodation. There is/are 2 sample(s) for Lulworth Student Company. There is/are 2 sample(s) for Unest. There is/are 2 sample(s) for Ashcourt. There is/are 2 sample(s) for Luna Students. There is/are 2 sample(s) for Stanton Asset Management. There is/are 2 sample(s) for ASN Capital. There is/are 2 sample(s) for Yellow Door Lets. There is/are 2 sample(s) for u-student. There is/are 2 sample(s) for HFS - Evo Student. There is/are 1 sample(s) for The Social Hub. There is/are 1 sample(s) for Student Letting Company. There is/are 1 sample(s) for Graysons Properties. There is/are 1 sample(s) for Apex Student Living. There is/are 1 sample(s) for Manor Villages. There is/are 1 sample(s) for Stay Clever. There is/are 1 sample(s) for Warehouse Students Ltd. There is/are 1 sample(s) for APS Property Group Ltd. There is/are 1 sample(s) for Beyond The Box Student Ltd. There is/are 1 sample(s) for Code Student Accommodation. There is/are 1 sample(s) for Bailrigg Student Living. There is/are 1 sample(s) for Oak Student Lets. There is/are 1 sample(s) for UniHouse. There is/are 1 sample(s) for Mapleisle Ltd. There is/are 1 sample(s) for Bagri Foundation. There is/are 1 sample(s) for East Of Exe. There is/are 1 sample(s) for Purple Frog Property Ltd. There is/are 1 sample(s) for Stoke Student Living. There is/are 1 sample(s) for Vanilla Lettings.
UK_median_premium = premium_df.asset_premium.median()
print(f'The UK median premium is {UK_median_premium:.2f}.')
The UK median premium is 1.00.
As we would expect, the median premium is $1$, which reflects no premium or discount at all. We now map on the sample sizes for each operator before using the smooth_medians function defined earlier to weight the encoded values and smooth out extremities.
operator_premium['sample_size'] = operator_premium.operator.map(operator_freq)
UK_premium_weighting = 5
operator_premium['smoothed_premium'] = operator_premium.apply(
lambda row: smooth_medians(UK_median_premium, row['asset_premium'], row['sample_size'], UK_premium_weighting),
axis = 1
)
operator_premium.sort_values(by = 'smoothed_premium', inplace = True, ascending = False)
operators = operator_premium.operator.tolist()
for operator in operators:
print(f'{operator}: {operator_premium[operator_premium["operator"] == operator]["smoothed_premium"].values[0]:.5f}.')
Vita Student: 1.03544. Novel Student: 1.03283. City Block: 1.02903. Study Inn: 1.02418. IconInc: 1.02087. UniLife: 1.01911. True Student: 1.01747. iQ Student Accommodation: 1.01676. Urban Sleep: 1.01578. Propeller Lettings: 1.01526. Urbanest: 1.01322. Now Students: 1.01171. Aspire Student Lettings: 1.01125. Prime Student Living: 1.01024. Future Generation Asset Management Limited: 1.01024. Downing Students: 1.00995. Scape Student Living: 1.00963. Volume Property: 1.00852. Student Roost: 1.00714. UniHouse: 1.00662. HFS - Prestige Student Living: 1.00661. SPACE Student Accomodation: 1.00561. HFS - Evo Student: 1.00547. Aparto Student: 1.00512. Collegiate AC: 1.00505. East Of Exe: 1.00482. ASN Capital: 1.00428. Luna Students: 1.00389. Bee Hive - Harington Investments: 1.00359. Chapter London: 1.00346. Abodus Student Living: 1.00318. Kexgill Student Accommodation: 1.00271. Carvels Lettings: 1.00245. Stoke Student Living: 1.00184. Nurtur Student Living: 1.00147. The Social Hub: 1.00101. Ashcourt: 1.00092. LIVStudent: 1.00082. Bailrigg Student Living: 1.00059. Student Facility Management: 1.00041. Student Cribs: 1.00017. u-student: 1.00006. Unest: 1.00000. Living Worcester Group: 1.00000. APS Property Group Ltd: 1.00000. Bagri Foundation: 1.00000. Apex Student Living: 1.00000. Mapleisle Ltd: 0.99994. Canvas Student: 0.99990. Metro Student Accommodation: 0.99935. Mezzino: 0.99922. Almero Student Mansions: 0.99913. Manor Villages: 0.99886. Fresh Student Living: 0.99867. Purple Frog Property Ltd: 0.99859. HFS - Universal Student Living: 0.99829. Vanilla Lettings: 0.99821. Here Students: 0.99790. Graysons Properties: 0.99769. Student Letting Company: 0.99748. Premier Student Halls: 0.99735. Megaclose Ltd: 0.99716. Yugo: 0.99707. Hello Student: 0.99702. Code Student Accommodation: 0.99691. DIGS Student: 0.99681. Fenton Property Holdings: 0.99678. Warehouse Students Ltd: 0.99657. Oak Student Lets: 0.99625. Stay Clever: 0.99608. Axo Student Living: 0.99506. Every Student: 0.99448. UNITE Students: 0.99422. CRM Students: 0.99408. Stanton Asset Management: 0.99374. Host Students: 0.99373. Yellow Door Lets: 0.99330. Lulworth Student Company: 0.99289. Beyond The Box Student Ltd: 0.99279. Student Castle: 0.99236. Cloud Student Homes: 0.99232. Caro Student Living: 0.99217. Allied Student Accommodation: 0.99126. Unipol: 0.99086. Sanctuary Student: 0.99016. City Estates: 0.98997. Derwent Students: 0.98970. Mansion Student: 0.98929. YPP: 0.98921. X1 Lettings : 0.98874. Heathfield Norwich Limited: 0.98852. Xenia Students: 0.98826. Gather Students: 0.98666. HFS - Homes For Students: 0.98639. N Joy Student Living: 0.98438. HFS - Urban Student Life: 0.98420. Dwell Student Living: 0.98305. Project Student: 0.97774. Find Digs : 0.97660. Key Let: 0.97570. Primo Property Management: 0.97221. Campus Living Villages: 0.96885.
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.histplot(operator_premium['smoothed_premium'], kde = True, color = (0, 0.13, 0.27), bins = 20)
ax.set_title("Smoothed Asset Premiums", fontweight = "semibold")
ax.set_xlabel("Smoothed Asset Premium", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
<Figure size 432x288 with 0 Axes>
As we can see, the smoothed asset premiums depicts a slightly more normal distribution, albeit still with fatter tails.
We now map the smoothed_premiums onto the transformed_X_train data set and tidy up the DataFrame.
operator_to_premium_dict = operator_premium.set_index('operator')['smoothed_premium'].to_dict()
transformed_X_train['operator'] = transformed_X_train.operator.map(operator_to_premium_dict)
transformed_X_train.drop(columns = ['asset_id'], inplace = True)
print(transformed_X_train.head())
print(transformed_y_train.head())
city operator beds age room_type_Non En-Suite \
101 5.941404 0.986656 3.465736 2.236068 0
336 5.941404 1.003461 5.723585 2.449490 0
637 5.190118 1.007143 6.879356 2.236068 0
1412 5.190118 0.986392 5.556828 2.236068 0
1456 5.447034 0.994220 6.208590 4.472136 0
room_type_One Bed room_type_Studio
101 0 1
336 0 1
637 1 0
1412 0 1
1456 0 1
101 5.743003
336 6.107023
637 5.624018
1412 5.105945
1456 5.225747
Name: weekly_rent, dtype: float64
Finally, we standardise all of the variables to allow for consistent coefficient comparisons in Section 6.
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
y_scaler = StandardScaler()
X_scaler.fit(transformed_X_train)
y_scaler.fit(transformed_y_train.to_numpy().reshape(-1, 1))
transformed_X_train = X_scaler.transform(transformed_X_train)
transformed_y_train = y_scaler.transform(transformed_y_train.to_numpy().reshape(-1, 1))
transformed_X_train = pd.DataFrame(transformed_X_train, columns = [
'city',
'operator',
'beds',
'age',
'room_type_Non En-Suite',
'room_type_One Bed',
'room_type_Studio'
])
transformed_y_train = pd.Series(transformed_y_train.flatten(), name = 'weekly_rent')
print(transformed_X_train.head())
print(transformed_y_train.head())
city operator beds age room_type_Non En-Suite \ 0 2.224795 -1.206764 -2.322244 -0.602575 -0.303558 1 2.224795 0.236646 0.344062 -0.399597 -0.303558 2 -0.782737 0.552917 1.708919 -0.602575 -0.303558 3 -0.782737 -1.229441 0.147138 -0.602575 -0.303558 4 0.245744 -0.557021 0.916808 1.524072 -0.303558 room_type_One Bed room_type_Studio 0 -0.272166 1.062868 1 -0.272166 1.062868 2 3.674235 -0.940851 3 -0.272166 1.062868 4 -0.272166 1.062868 0 0.990764 1 1.980077 2 0.667391 3 -0.740598 4 -0.415009 Name: weekly_rent, dtype: float64
As we can see above, we have both transformed_X_train and transformed_y_train ready for the model, with all data having been transformed and encoded, as required. All of the data has then been standardised.
We now proceed to repeat these encoding, transformation, and standardisation steps with the test data before training and assessing a Linear Regression model.
In this section, we begin by transforming our test data in the same manner as we transformed our training data to ensure the data is ready to be used for modelling. We note this will be using transformations as influenced by the training data only to minimise data leakage.
We then train a Linear Regression model on the training data before applying it to our test data. Finally we evaluate the model and consider some further modelling approaches before iterating these models to achieve a model with the best performance possible.
We start by applying the logarithm transformations to both weekly_rent in y_test and the beds variable in X_test.
transformed_y_test = y_test.apply(lambda x: np.log(x))
transformed_X_test = X_test.copy()
transformed_X_test['beds'] = X_test.beds.apply(lambda x: np.log(x))
Next, we calculate the age variable and transform it with the $f(x) = \sqrt{x}$ transformation, before dropping the build_date variable.
transformed_X_test['age'] = current_year - transformed_X_test['build_date']
transformed_X_test['age'] = transformed_X_test.age.apply(lambda x: np.sqrt(x))
transformed_X_test.drop(columns = ['build_date'], axis = 1, inplace = True)
We now OHE the room_type, dropping the first column to maintain a lower dimensionality.
transformed_X_test = pd.get_dummies(transformed_X_test, columns = ["room_type"], drop_first = True)
We now target encode both city and operator, making sure to use the encoding values from the training set. Any NaN values for cities or operators not in the training set are filled with the global UK medians, calculated across both cities and operators, respectively on the entire training set.
transformed_X_test['city'] = transformed_X_test.city.map(city_to_rent_dict)
transformed_X_test['city'].fillna(UK_median_rent, inplace = True)
transformed_X_test['operator'] = transformed_X_test.operator.map(operator_to_premium_dict)
transformed_X_test['operator'].fillna(UK_median_premium, inplace = True)
Finally, we drop the asset_id column.
transformed_X_test.drop(columns = ['asset_id'], inplace = True, axis = 1)
We now have a fully transformed X_test and y_test, ready for modelling.
print(transformed_X_test.head())
print(transformed_y_test.head())
city operator beds age room_type_Non En-Suite \
171 5.587713 1.007143 4.465908 2.000000 0
9 5.587713 0.997017 5.111988 2.236068 0
1627 5.071879 0.999224 4.343805 3.162278 1
487 5.540324 0.994085 4.127134 1.414214 0
961 5.378205 0.989286 5.252273 2.236068 1
room_type_One Bed room_type_Studio
171 0 1
9 0 1
1627 0 0
487 0 1
961 0 0
171 5.877736
9 5.957132
1627 4.836282
487 5.926926
961 4.997212
Name: weekly_rent, dtype: float64
transformed_X_test = X_scaler.transform(transformed_X_test)
transformed_y_test = y_scaler.transform(transformed_y_test.to_numpy().reshape(-1, 1))
transformed_X_test = pd.DataFrame(transformed_X_test, columns = [
'city',
'operator',
'beds',
'age',
'room_type_Non En-Suite',
'room_type_One Bed',
'room_type_Studio'
])
transformed_y_test = pd.Series(transformed_y_test.flatten(), name = 'weekly_rent')
print(transformed_X_test.head())
print(transformed_y_test.head())
city operator beds age room_type_Non En-Suite \ 0 0.808906 0.552917 -1.141135 -0.827091 -0.303558 1 0.808906 -0.316809 -0.378176 -0.602575 -0.303558 2 -1.256070 -0.127228 -1.285327 0.278311 3.294264 3 0.619199 -0.568682 -1.541195 -1.384212 -0.303558 4 -0.029793 -0.980839 -0.212512 -0.602575 3.294264 room_type_One Bed room_type_Studio 0 -0.272166 1.062868 1 -0.272166 1.062868 2 -0.272166 -0.940851 3 -0.272166 1.062868 4 -0.272166 -0.940851 0 1.356933 1 1.572711 2 -1.473475 3 1.490619 4 -1.036107 Name: weekly_rent, dtype: float64
We now fit a Linear Regression model on our data and use evaluation metrics, such as the Mean Square Error ("MSE") and the $R^{2}$ score, to get a sense of the accuracy.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(transformed_X_train, transformed_y_train)
y_train_pred = model.predict(transformed_X_train)
y_test_pred = model.predict(transformed_X_test)
We write the below as a function to expedite our iterations of the model as we try and evaluate and compare different approaches.
from sklearn.metrics import mean_squared_error, r2_score
def model_metrics(y_train_pred, y_test_pred):
train_mse = mean_squared_error(transformed_y_train, y_train_pred)
train_r2 = r2_score(transformed_y_train, y_train_pred)
test_mse = mean_squared_error(transformed_y_test, y_test_pred)
test_r2 = r2_score(transformed_y_test, y_test_pred)
# We then undo the standardisation on the two MSEs to allow for accurate comparison
train_mse_original = train_mse * (y_scaler.scale_[0])**2
test_mse_original = test_mse * (y_scaler.scale_[0])**2
return train_mse_original, test_mse_original, train_r2, test_r2
train_mse, test_mse, train_r2, test_r2 = model_metrics(y_train_pred, y_test_pred)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0197 Test MSE: 0.0241 Training R²: 0.8544 Test R²: 0.8280
Here we see that the MSE for the training set is only $0.0197$ to four decimal places, which is low and shows that the model is fitting well to the training data and has a high degree of accuracy. We note that the test data has an MSE of $0.0241$ to four decimal places, which is higher than the training set but still low. The MSE of the test set being close to that of the training set suggests that the model generalises well and is not overfit to the training data. We note that we expect the MSE to be higher for the test data than the training data as the model will always marginally overfit to the data it is trained upon but the small change is a positive sign.
An $R^{2}$ score of $0.8544$ and $0.8280$ for the training set and test set, respectively, is also a very positive sign. This suggests that ~$83\%$ of the variance in the weekly_rent variable is explained by our independent variables in our model. The remaining ~$17\%$ is explained either by variables we do not have or by inherent variation. An $R^{2}$ score of ~$83\%$ is high and suggests that the Linear Regression model is indeed suitable, especially given the context of the data as being in the socially-influenced market of real estate rental rates, where factors such as marketing can play a significant role.
We now consider the Root Mean Square Error ("RMSE") to derive further measures of accuracy. We note that we are considering both the RMSE and the values of weekly_rent in their log-transformed states, which reduces interpretability but is acceptable for the point of comparison.
def rmse_metrics(train_mse, test_mse):
train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)
'''
We now calcuate the mean weekly_rent for both train and test as well as the range of weekly_rent in both data sets.
Again, we remove the standardisation to ensure accuracy in comparison.
'''
mean_train_weekly_rent = (np.mean(transformed_y_train)*(y_scaler.scale_[0])) + y_scaler.mean_[0]
mean_test_weekly_rent = (np.mean(transformed_y_test) * (y_scaler.scale_[0])) + y_scaler.mean_[0]
range_train_weekly_rent = (np.max(transformed_y_train) - np.min(transformed_y_train))*y_scaler.scale_[0]
range_test_weekly_rent = (np.max(transformed_y_test) - np.min(transformed_y_test))*y_scaler.scale_[0]
#We now calculate some further metrics for comparison across models
train_relative_average_error = train_rmse / mean_train_weekly_rent
test_relative_average_error = test_rmse / mean_test_weekly_rent
train_overall_variability = train_rmse / range_train_weekly_rent
test_overall_variability = test_rmse / range_test_weekly_rent
print(f'Training Relative Average Error: {train_relative_average_error:.2%}')
print(f'Test Relative Average Error: {test_relative_average_error:.2%}')
print(f'Training Proportion of Overall Variability: {train_overall_variability:.2%}')
print(f'Test Proportion of Overall Variability: {test_overall_variability:.2%}')
rmse_metrics(train_mse, test_mse)
Training Relative Average Error: 2.61% Test Relative Average Error: 2.89% Training Proportion of Overall Variability: 6.20% Test Proportion of Overall Variability: 7.61%
We note that the RMSE represents ~$2.61\%$ of the average weekly_rent for the training set and ~$2.89\%$ for the test set. This suggests that the relative average error of the two sets is quite low, again providing support for the efficacy of the model.
Furthermore, the RMSE as a proportion of the range of the weekly_rent variable is ~$6.20\%$ and ~$7.61\%$ for the training and test set, respectively. This suggests the RMSE is a small proportion of the target range, again suggesting the model performance is good.
We now consider the residuals of the model.
train_residuals = transformed_y_train - y_train_pred
def residual_plot(residuals, predictions):
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
ax.scatter(predictions, residuals, color = (0, 0.13, 0.27), alpha = 0.8)
ax.axhline(y = 0, color = (0.95, 0.65, 0.07), ls = '--')
ax.set_xlabel('Predicted Values', fontweight = 'medium')
ax.set_ylabel('Residuals', fontweight = 'medium')
ax.set_title('Residual Plot', fontweight = 'semibold')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
residual_plot(train_residuals, y_train_pred)
<Figure size 432x288 with 0 Axes>
As we can see, the residuals are fairly random and scattered around zero, as we would expect. The residuals appear to be fairly homoscedastic with approximately equal variance in the residuals as we move up and down the x-axis. We note that there are some outliers, for example the two points in the top right and some points near the bottom, which we may decide to rectify later.
We now consider the distribution of the residuals below.
def residuals_hist(residuals, bins = None):
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
if bins is None:
sns.histplot(residuals, kde = True, color = (0, 0.13, 0.27))
else:
sns.histplot(residuals, kde = True, color = (0, 0.13, 0.27), bins = bins)
ax.set_title("Residuals Distribution", fontweight = "semibold")
ax.set_xlabel("Residuals", fontweight = "medium")
ax.set_ylabel("Count", fontweight = "medium")
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
residuals_hist(train_residuals)
<Figure size 432x288 with 0 Axes>
As we can see, the residuals are normally distributed around zero, albeit we note there are some outliers in the right and left tails. We examine the QQ plot of the residuals below.
import scipy.stats as stats
def residuals_qq(residuals):
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
chart = stats.probplot(residuals, dist = 'norm', plot = ax)
ax.get_lines()[0].set_markeredgecolor((0, 0.13, 0.27))
ax.get_lines()[0].set_markerfacecolor((0, 0.13, 0.27))
ax.get_lines()[1].set_color((0.95, 0.65, 0.07))
ax.get_lines()[1].set_linestyle('--')
ax.set_title('Residuals QQ Plot', fontweight = 'semibold')
ax.set_xlabel('Theoretical Quantiles', fontweight = 'medium')
ax.set_ylabel('Ordered Values', fontweight = 'medium')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
residuals_qq(train_residuals)
<Figure size 432x288 with 0 Axes>
We can see that there is a strong linear pattern but note that there are two outliers towards the upper right of the plot as well as some outliers towards the left tail.
We will investigate the outliers further to decide whether they need to be removed or not. We start by calculating the absolute standardised residual and considering those with a score above three as outliers.
residuals = pd.DataFrame(train_residuals)
residuals.rename(columns = {
'weekly_rent' : 'residuals'
}, inplace = True)
res_std_dev = residuals.residuals.std()
residuals['standardised'] = residuals.residuals.apply(lambda x: x / res_std_dev).abs()
outlier_indices = residuals[residuals['standardised'] >= 3].index.to_list()
print(residuals.loc[outlier_indices])
residuals standardised 563 -1.201685 3.147767 659 -1.175193 3.078373 684 1.437216 3.764732 813 1.433033 3.753776
There are four residuals, which when standardised pass our test for being considered outliers. In order to examine these further, we examine these results from the original training set prior to all our transformations.
X_train_reset_index = X_train.reset_index(drop = True)
outlier_features = X_train_reset_index.loc[outlier_indices]
y_train_reset_index = y_train.reset_index(drop = True)
outlier_rents = y_train_reset_index.loc[outlier_indices]
outliers = outlier_features.copy()
outliers['weekly_rent'] = outlier_rents
print(outliers)
asset_id city operator beds build_date \
563 836 Carlisle Unest 50 2016
659 908 Huddersfield iQ Student Accommodation 653 2014
684 917 London iQ Student Accommodation 420 2013
813 903 London iQ Student Accommodation 171 2014
room_type weekly_rent
563 Non En-Suite 80.0
659 En-Suite 102.5
684 En-Suite 632.0
813 En-Suite 627.5
As we can see there are four outliers, which were identified through both the residual plots and the standardised residual test. Immediately we see that iQ operated assets comprise three of the four outliers. This is unsurprising as iQ has some of the most beds in the country and operates in a vast majority of the UK markets. iQ therefore has a very large range of achieved weekly_rents, which has probably led to some confusion when considering the influence of the operator feature.
The first outlier is a for non en-suite rooms from the Unest asset in Carlisle. We note that the data is accurate and that this outlier status is likely a result of Carlisle having an incredibly low supply of PBSA on account of there being very little demand for PBSA there. Furthermore, due to Carlisle's inherently low-value real estate market, the rents for any PBSA asset will be low in Carlisle in comparison to other cities, marking the city as one of the extremities on a national scale. This is also likely to contribute to its outlier status. We are not concerned with the data here and leave this entry as it is.
The second outlier corresponds to the studio rooms in iQ Castings in Huddersfield. We note that this has been verified as an accurate price and is in line with iQ's other asset Little Aspley House in Huddersfield. We leave this as it is.
The final two outliers correspond to the en-suite room type in two different iQ assets in London. We note that these two assets are iQ Bloomsbury and iQ Hammersmith, respectively. Both assets are in incredible locations in London, notably Bloomsbury, which is directly adjacent to numerous universities. Given this, it appears that these rates, whilst high, are indeed accurate and a byproduct of the the extremely desirable sub-markets within London that these assets are situated in. Given this, we will leave these two data points in the data set.
We now consider the standardised coefficients of the Linear Regression model to evaluate which factors are having a significant effect on the model and which are not.
coefficients = model.coef_
intercept = model.intercept_
feature_names = transformed_X_train.columns
coef_df = pd.DataFrame({
"feature": feature_names,
"coefficient": coefficients
})
print(f'The intercept is {intercept:.4f}.')
for feature in coef_df.feature.tolist():
print(f'The {feature} feature has a coefficient of {coef_df[coef_df["feature"] == feature]["coefficient"].values[0]:.4f}.')
The intercept is -0.0000. The city feature has a coefficient of 0.7124. The operator feature has a coefficient of 0.2278. The beds feature has a coefficient of 0.0286. The age feature has a coefficient of -0.1029. The room_type_Non En-Suite feature has a coefficient of -0.0598. The room_type_One Bed feature has a coefficient of 0.3185. The room_type_Studio feature has a coefficient of 0.3565.
As we can see, clearly the city feature is the most significant, given it has the largest coefficient. We note that both beds and room_type_Non En-Suite are the least significant.
City being the most important is as we expected. Furthermore, age and room_type_Non En-Suite having negative coefficients makes sense, given the 'base' case (a zero in all the OHE room_type columns) is an en-suite and we would expect a non en-suite to be cheaper than an en-suite, were all else fixed.
Below, we consider a correlation matrix for all the features and the independent weekly_rent variable to assess which dependent variables are having a strong linear correlation with weekly_rent and to check for multicollinearity.
corr_data = transformed_X_train.copy()
corr_data['weekly_rent'] = transformed_y_train
corr_train_matrix = corr_data.corr()
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
sns.heatmap(corr_train_matrix, annot = True, cmap = custom_palette, fmt = '.2f')
ax.set_title('Correlation Heatmap of Variables', fontweight = 'semibold')
plt.show()
<Figure size 432x288 with 0 Axes>
We note that as above city has the largest correlation with weekly_rent. Furthermore, we note that none of the variables have any multicollinearity, with the largest correlation being between room_type_Studio and room_type_Non En-Suite of $-0.29$.
Given the above correlation plot, we will not consider removing any features due to multicollinearity. However, we will consider removing both the beds and room_type_Non En-Suite feature given both their small absolute coefficients and low correlations with weekly_rent to see if that improves the model via simplification.
We start by removing the beds feature.
transformed_X_train_no_beds = transformed_X_train.drop(columns = ['beds'])
transformed_X_test_no_beds = transformed_X_test.drop(columns = ['beds'])
We now train a new model on this refined data set before evaluating it using the metrics we used previously to decide if it is a more accurate model or not.
model_no_beds = LinearRegression()
model_no_beds.fit(transformed_X_train_no_beds, transformed_y_train)
y_train_pred_no_beds = model_no_beds.predict(transformed_X_train_no_beds)
y_test_pred_no_beds = model_no_beds.predict(transformed_X_test_no_beds)
train_mse, test_mse, train_r2, test_r2 = model_metrics(y_train_pred_no_beds, y_test_pred_no_beds)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0198 Test MSE: 0.0244 Training R²: 0.8536 Test R²: 0.8257
We note that this is broadly a very similar outcome to what we had before, showing that beds was having very little impact. Broadly this model is marginally inferior to that of the original model with a slightly higher MSE and a lower $R^{2}$ score, but still generalises well.
We now consider a model with just the room_type_Non En-Suite feature removed.
transformed_X_train_no_NES = transformed_X_train.drop(columns = ['room_type_Non En-Suite'])
transformed_X_test_no_NES = transformed_X_test.drop(columns = ['room_type_Non En-Suite'])
model_no_NES = LinearRegression()
model_no_NES.fit(transformed_X_train_no_NES, transformed_y_train)
y_train_pred_no_NES = model_no_NES.predict(transformed_X_train_no_NES)
y_test_pred_no_NES = model_no_NES.predict(transformed_X_test_no_NES)
train_mse, test_mse, train_r2, test_r2 = model_metrics(y_train_pred_no_NES, y_test_pred_no_NES)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0201 Test MSE: 0.0245 Training R²: 0.8513 Test R²: 0.8254
This model is also marginally inferior to the original model. It has inferior training metrics and a marginally inferior test MSE and a $R^{2}$, suggesting a similar model that also generalises well, albeit slightly inferior. We would take the original Linear Regression model over both of these iterations due to its superior performance.
Other refinements we could employ include looking at regularisation techniques such as ridge or lasso. However, we note that given the model currently generalises well to the test data, as evidenced by the comparable $R^{2}$ scores between the train and test sets, as well as the low dimensionality given the Target Encoding we employed, alongside the lack of multicollinearity, regularisation is unlikely to improve the model and therefore we will not explore the idea in this study.
We now wish to consider other potential regression models to see if they perform better on the data. We note that the residual analysis above suggests the data is fairly homoscedastic and appears to have normally distributed residuals, which is why the Linear Regression model performed well.
However, we wish to see whether some more complex models that can handle non-linear relationships can better model the data. In the next two sections we shall consider two further models - Random Forest and Gradient Boosting. We will compare the model's accuracies and eventually choose one to take forward for refinement.
We start with a Random Forest model. Given that the Linear Regression model we are using is the original model, we will compare to that one by training the Random Forest model on that data set.
We note that whilst a Random Forest model does not require standardised data or normally distributed data, which we have following our data transformations, it can work with this data and so we refrain from re-transforming the data.
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor(n_estimators = 200, random_state = 14)
rf_model.fit(transformed_X_train, transformed_y_train)
rf_y_train_pred = rf_model.predict(transformed_X_train)
rf_y_test_pred = rf_model.predict(transformed_X_test)
train_mse, test_mse, train_r2, test_r2 = model_metrics(rf_y_train_pred, rf_y_test_pred)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0025 Test MSE: 0.0226 Training R²: 0.9816 Test R²: 0.8386
As we can see, the MSE on both the training and test sets is lower than that of the Linear Regression Mode. Moreover, with $R^{2}$ scores of $0.9816$ and $0.8386$ for the training and test sets, respectively, this model appears to be predicting the variability in the data to a higher degree than the Linear Regression model. However, we note that the larger drop off between the training set $R^{2}$ score and the test set $R^{2}$ score suggests that this model is overfit to the training data and is not generalising as well to the unseen test data. We may be able to remedy this using some parameter tuning.
First we examine how significant each feature is in this model.
def feature_importance(model, X_train):
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
plt.clf()
fig, ax = plt.subplots(figsize = (10, 6))
plt.bar(range(X_train.shape[1]), importances[indices], align = "center", color = (0, 0.13, 0.27))
ax.set_title('Feature Importance', fontweight = 'semibold')
ax.set_xlabel('Feature', fontweight = 'medium')
ax.set_ylabel('Importance', fontweight = 'medium')
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation = 90)
ax.set_xlim([-1, X_train.shape[1]])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
feature_importance(rf_model, transformed_X_train)
<Figure size 432x288 with 0 Axes>
Again, similarly to the Linear Regression model, we see that the city variable is by far the most important. Furthermore, we note that room_type_Non En-Suite is the least important. Perhaps a Random Forest model without this feature would be an improvement.
We now consider the residuals of this model. We note that, as a non-linear model, we do not require homoscedasticity, normally distributed residuals, or a linear QQ plot. However, we examine these residuals for completeness and as a point of comparison.
rf_train_residuals = transformed_y_train - rf_y_train_pred
residual_plot(rf_train_residuals, rf_y_train_pred)
<Figure size 432x288 with 0 Axes>
Whilst the residuals are centered around zero, we note that there does appear to be a slight heteroscedasticity here, with the variance of the residuals appearing larger as we move along the x-axis. There also seems to be a slight pattern in the residuals with the residuals increasing as the predicted values does.
Below we plot the distribution of the residuals.
residuals_hist(rf_train_residuals)
<Figure size 432x288 with 0 Axes>
We note that the residuals appear to be broadly normally distributed albeit less so than the residuals for the Linear Regression. Furthermore, there is a rough symmetry here that suggests little bias.
Below we plot the QQ plot for the residuals from this model.
residuals_qq(rf_train_residuals)
<Figure size 432x288 with 0 Axes>
Here we see that whilst there is a linear relationship in the main, suggesting a degree of normality, the tails deviate at both ends. This suggests the tails of the distribution are fatter than that of a normal distribution, unlike when we considered the residuals from the Linear Regression model, which were broadly normal.
We now consider another non-parametric model in Gradient Boosting to see if we can garner superior results to that of the Random Forest. Again, for the point of comparison, we train this model on the original data set.
from sklearn.ensemble import GradientBoostingRegressor
gb_model = GradientBoostingRegressor(n_estimators = 200, random_state = 14)
gb_model.fit(transformed_X_train, transformed_y_train)
gb_y_train_pred = gb_model.predict(transformed_X_train)
gb_y_test_pred = gb_model.predict(transformed_X_test)
train_mse, test_mse, train_r2, test_r2 = model_metrics(gb_y_train_pred, gb_y_test_pred)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0096 Test MSE: 0.0203 Training R²: 0.9287 Test R²: 0.8550
We note that this model also appears to outperform the Linear Regression model. Furthermore, we note that, the $R^{2}$ score for the training data is lower for the Gradient Boosting model at $0.9287$ than the Random Forest model at $0.9816$, and that the $R^{2}$ score for the test data is notably superior in this model at $0.8550$ in comparison to $0.8386$, suggesting that this model generalises better to unseen data as there is a smaller drop off, whilst increasing on accuracy for the test data.
That being said, we should still note that there is a drop off, albeit smaller, between the $R^{2}$ score for the training data and the $R^{2}$ score for the test data, suggesting once again that there is room for improvement to prevent the apparent overfitting to the training set.
We now consider the significance of each feature in this model.
feature_importance(gb_model, transformed_X_train)
<Figure size 432x288 with 0 Axes>
Again, we can see that city is by far the most important feature and that room_type_Non En-Suite is the least. Interestingly, in comparison to the Random Forest Model, the beds feature has a smaller significance in this model and is more comparable to room_type_Non En-Suite.
We now consider the residuals of this model, again noting the stipulations required for the Linear Regression residuals are not present. We include the as a point of comparison.
gb_train_residuals = transformed_y_train - gb_y_train_pred
residual_plot(gb_train_residuals, gb_y_train_pred)
<Figure size 432x288 with 0 Axes>
Here the residuals appear to be fairly homoscedastic albeit there are some outliers which may affect the histogram and QQ plots. We now consider the histogram of the residuals.
residuals_hist(gb_train_residuals)
<Figure size 432x288 with 0 Axes>
As expected, given the residuals plot above, the histogram shows a fairly normal distribution, albeit with fatter tails. We now consider the QQ plot for these residuals.
residuals_qq(gb_train_residuals)
<Figure size 432x288 with 0 Axes>
Again we have a broadly linear pattern, with a similar deviations at the tails to the Random Forest model. This suggests that this Gradient Boosting model has also got a distribution with slightly fatter tails.
The Gradient Boosting model is superior from a $R^{2}$ and MSE perspective to the Random Forest model. Furthermore, the Gradient Boosting model experiences a smaller drop off from test $R^{2}$ to training $R^{2}$, suggesting a better baseline generalisation than the Random Forest model. Although the Linear Regression model is simpler and shows an even better generalisation between test and train sets, its $R^{2}$ score is inferior to both the Random Forest and Gradient Boosting models and therefore we have elected to select the Gradient Boosting model moving forward. Given this, we now want to consider tuning the model to see if we can achieve marginally better performance.
Below, we use Grid Search to seek the best hyperparameter values for the model, so we can train a superior and more accurate model.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7],
'min_samples_split' : [2, 5, 10],
'subsample': [0.8, 0.9, 1.0]
}
grid_gb_model = GradientBoostingRegressor(random_state = 14)
grid_search = GridSearchCV(grid_gb_model, param_grid, cv = 5, n_jobs = -1, verbose = 1)
grid_search.fit(transformed_X_train, transformed_y_train)
best_params = grid_search.best_params_
print(f"The best parameters are as follows: {best_params}")
print(f"This gives a best cross-validation score of {grid_search.best_score_:.4f}")
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
The best parameters are as follows: {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 200, 'subsample': 0.9}
This gives a best cross-validation score of 0.8911
optim_gb_model = GradientBoostingRegressor(**best_params, random_state = 14)
optim_gb_model.fit(transformed_X_train, transformed_y_train)
optim_gb_y_train_pred = optim_gb_model.predict(transformed_X_train)
optim_gb_y_test_pred = optim_gb_model.predict(transformed_X_test)
train_mse, test_mse, train_r2, test_r2 = model_metrics(optim_gb_y_train_pred, optim_gb_y_test_pred)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0025 Test MSE: 0.0177 Training R²: 0.9815 Test R²: 0.8740
We can see that the model has a low MSE on the training set and a low MSE on the test set. Moreover, the $R^{2}$ score on both the training and test set has improved. However, we note that the drop off between the two is larger, which suggests the model is overfitting to the training data.
We now use Cross Validation to check the robustness of the model by training it on five different subsamples of the transformed_X_train data set and then testing it using the $R^{2}$ score.
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(optim_gb_model, transformed_X_train, transformed_y_train, cv = 5, scoring = 'r2')
print('Cross-Validation R² Scores')
for i, score in enumerate(cv_scores):
print(f'Iteration {i + 1}: {score:.4f}')
print(f'Mean CV R²: {cv_scores.mean():.4f}')
Cross-Validation R² Scores Iteration 1: 0.9134 Iteration 2: 0.8919 Iteration 3: 0.8925 Iteration 4: 0.8845 Iteration 5: 0.8734 Mean CV R²: 0.8911
Again, we note there is a drop from the mean cross validation $R^{2}$ score of $0.8911$ to the test $R^{2}$ score of $0.8740$, further supporting the idea that the model is overfitting, albeit we do note it is not a substantial drop off. However, the Gradient Boosting model is known for being complex and so this overfitting may be a result of that.
Below, we attempt to combat this. We train another Gradient Boosting model where we reduce max_depth in order to reduce complexity, whilst simultaneously reducing the learning_rate to avoid overfitting and increasing n_estimator to compensate and stabilise the model.
optim_gb_model = GradientBoostingRegressor(learning_rate = 0.03,
max_depth = 3,
n_estimators = 400,
min_samples_split = 2,
subsample = 0.8,
random_state = 14
)
optim_gb_model.fit(transformed_X_train, transformed_y_train)
optim_gb_y_train_pred = optim_gb_model.predict(transformed_X_train)
optim_gb_y_test_pred = optim_gb_model.predict(transformed_X_test)
train_mse, test_mse, train_r2, test_r2 = model_metrics(optim_gb_y_train_pred, optim_gb_y_test_pred)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0111 Test MSE: 0.0204 Training R²: 0.9178 Test R²: 0.8545
As we can see, we have sacrificed some accuracy in both the MSE and $R^{2}$ scores with this version of the model. However, we notice that the difference between the training $R^{2}$ score and the test $R^{2}$ score is approximately $0.0633$, which is significantly less than before albeit more than twice the drop off in the original Linear Regression model. However, this model does have higher overall metrics and we note that broadly a drop off of $0.0633$ is probably acceptable. Furthermore, this model has a similar test $R^{2}$ score to that of the original Gradient Boosting model, yet benefits from better generalisation. Therefore, we have found a model that has a decent generalisation with an overall $R^{2}$ score on the test set of over ~$85\%$.
cv_scores = cross_val_score(optim_gb_model, transformed_X_train, transformed_y_train, cv = 5, scoring = 'r2')
print('Cross-Validation R² Scores')
for i, score in enumerate(cv_scores):
print(f'Iteration {i + 1}: {score:.4f}')
print(f'Mean CV R²: {cv_scores.mean():.4f}')
Cross-Validation R² Scores Iteration 1: 0.9012 Iteration 2: 0.8812 Iteration 3: 0.8760 Iteration 4: 0.8707 Iteration 5: 0.8637 Mean CV R²: 0.8786
We note as well that the average cross-validation $R^{2}$ score is closer to the one we are achieving on the test set, further exemplifying this model's ability to generalise.
As well as optimising our model, we could also use the Ensemble Method by averaging the predictions of our optimised Gradient Boosting model and the original Linear Regression model in an attempt to attain both a high test $R^{2}$ score whilst maintaining a good level of generalisation.
ensemble_y_train_pred = (optim_gb_y_train_pred + y_train_pred) / 2
ensemble_y_test_pred = (optim_gb_y_test_pred + y_test_pred) / 2
train_mse, test_mse, train_r2, test_r2 = model_metrics(ensemble_y_train_pred, ensemble_y_test_pred)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0143 Test MSE: 0.0211 Training R²: 0.8941 Test R²: 0.8494
As we can see, this model maintains the overall test $R^{2}$ score around ~$85\%$, and also has superior generalisation, with the model being less overfit to the training set, as exemplified by the smaller gap between the training and test $R^{2}$ score of around $0.0447$. Moreover, we note that the MSE for the test set is not too inferior for this ensemble model in comparison to our optimised Gradient Boosting model.
We now consider the two models. One is our optimised Gradient Boosting model whilst the other is our ensemble model which incorporates an average of both the high-scoring optimised Gradient Boosting model with the lower scoring Linear Regression model, which is better at generalising to the training data and less prone to overfitting.
Whilst only using one model is simpler, the averaging of the two does have a logical foundation, with the Gradient Boosting model providing the ability to model anything that is non-linear with the original Linear Regression model providing better generality and also appropriate given the homoscedasticity of the data. Given this logical concept of combining the best of both a model that is prone to overfitting but can model non-linear trends with a linear model that generalises well and is reflective of the data, we are going to proceed with the ensemble model.
For completeness, we examine the residuals, the distribution of the residuals, and a QQ plot.
ensemble_train_residuals = transformed_y_train - ensemble_y_train_pred
residual_plot(ensemble_train_residuals, ensemble_y_train_pred)
<Figure size 432x288 with 0 Axes>
Here, as expected, we see a fairly good pattern, with the residuals scattered around zero, signs of homoscedasticity, and the two outliers in the top right accounted for through previous explanation.
residuals_hist(ensemble_train_residuals)
<Figure size 432x288 with 0 Axes>
As expected, the residuals are broadly normal with a symmetry suggesting zero bias.
residuals_qq(ensemble_train_residuals)
<Figure size 432x288 with 0 Axes>
The QQ plot shows good signs of normality, given we are incorporating a linear model in our ensemble. We note the two obvious outliers in the right tail have been accounted for and there is a slight skew in the left tail but it is not too egregious.
In this section, we have trained various different models on the transformed data. We began by considering a simple Linear Regression model which proved to be accurate and generalise well. We then considered non-parametric models like the Random Forest and Gradient Boosting. Both provided stronger scores than the Linear Regression model but were clearly prone to overfitting.
After initially deciding to pursue the Gradient Boosting model, we refined the hyperparamaters using Grid Search to arrive at a version of the model with a higher $R^{2}$ score, albeit in doing so we sacrificed generality. It was the clear the model was overfitting. In redefining a new model with different hyperparameters, we were able to reduce this overfitting and increase the generality, with the cost being a reduced $R^{2}$ score.
We then considered the Ensemble Method, which involves averaging the predictions of multiple models to smooth out any outliers or overfitting. We took both the Linear Regression model and our optimised and high $R^{2}$-scoring Gradient Boosting model and averaged the two. The result was a model with a similar $R^{2}$ score and MSE to the optimised Gradient Boosting model but with a more logical and interpretable origin as well as superior generalisation. Given this, we decided to use this model going forward.
In this section, we shall evaluate the chosen model across a range of metrics. We will then interpret both the residuals and importance of features for the model within the real-world context of PBSA rental rates. Finally, we examine what this has told us about the PBSA rental market in the UK in a wider context.
We chose to move forward with an ensemble model that averaged the prediction of a simple Linear Regression model and the more complex Gradient Boosting model. The reasoning for this was that the two models combined provides a degree of robustness, given that the Linear Regression is less prone to overfitting but the Gradient Boosting model can model non-linear trends more accurately.
Despite our data proving to be fairly normal and the Linear Regression model performing well, we found that the Gradient Boosting model performed better from an $R^{2}$ perspective but was prone to overfitting. Therefore, taking an average of the two provides a higher performing model that generalises better than just the Gradient Boosting model alone and we deemed this was worth the expense of losing the high level of interpretability one has when they use just a Linear Regression model.
Below we test the ensemble model by considering the MSE, $R^{2}$ score, and the RMSE as a proportion of the variance found in the data set.
train_mse, test_mse, train_r2, test_r2 = model_metrics(ensemble_y_train_pred, ensemble_y_test_pred)
print(f'Training MSE: {train_mse:.4f}')
print(f'Test MSE: {test_mse:.4f}')
print(f'Training R²: {train_r2:.4f}')
print(f'Test R²: {test_r2:.4f}')
Training MSE: 0.0143 Test MSE: 0.0211 Training R²: 0.8941 Test R²: 0.8494
The MSE for the training set is $0.0143$, which is low and shows that the model is fitting well to the training data with a high degree of accuracy. In comparison, the test set data has an MSE of $0.0211$ to four decimal places, which is higher than the training set, as we would expect, but still low. Moreover, the two MSEs being fairly close suggests that the model generalises well and is not overfit to the training data.
Furthermore, the training data has an $R^{2}$ score of $0.8941$, whilst the test set score is $0.8494$. This suggests that ~$85\%$ of the variance in the weekly_rent variable is explained by our independent variables in our model on the test set. The remaining ~$15\%$ may be explained by inherent variation in the variable or by other features we do not have data for in our dataset. We note that an $R^{2}$ score of ~$85\%$ is high and suggests that the ensemble model is performing well. Furthermore, we note that typically real estate rents can be socially-influenced, given specific areas moving in and out of popularity as time progresses, and the unaccounted for aspects, such as marketing and curb appeal also having an affect. Given this, attaining an $R^{2}$ score of $90\%$ or more is unlikely on such data, given the inherent variation we discussed.
We now examine the RMSE.
rmse_metrics(train_mse, test_mse)
Training Relative Average Error: 2.23% Test Relative Average Error: 2.70% Training Proportion of Overall Variability: 5.29% Test Proportion of Overall Variability: 7.12%
Above we have the RMSE represents ~$2.23\%$ of the average weekly_rent for the training data and ~$2.70\%$ of the average weekly_rent for the test data. This is a low relative average error, and is lower than the Linear Regression model alone. This suggests that the model is performing well.
We also note that the RMSE as a proportion of the weekly_rent variable represents ~$5.29\%$ for the training data and ~$7.12\%$ for the test data. Again, this represents a small proportion of the target range, suggesting the model's performance is strong.
We now examine the residuals of the test data for the ensemble model.
ensemble_test_residuals = transformed_y_test - ensemble_y_test_pred
residual_plot(ensemble_test_residuals, ensemble_y_test_pred)
<Figure size 432x288 with 0 Axes>
We note that since we are averaging out the Linear Regression model with a model that does not require homoscedasticity, this condition is less significant. However, for completeness, these residuals appear to show some signs of 'fanning out' as we move along the x-axis, which is a sign of heteroscedasticity.
We now test to see if there are any outliers in this test set, as measured by our test earlier. Again we standardise the absolute values of the residuals and compare them to our condition of three.
test_residuals = pd.DataFrame(ensemble_test_residuals)
test_residuals.rename(columns = {
'weekly_rent' : 'residuals'
}, inplace = True)
test_res_std_dev = test_residuals.residuals.std()
test_residuals['standardised'] = test_residuals.residuals.apply(lambda x: x / test_res_std_dev).abs()
test_outlier_indices = test_residuals[test_residuals['standardised'] >= 3].index.to_list()
if len(test_outlier_indices) == 0:
print('There are no observed outliers.')
else:
print(test_residuals.loc[test_outlier_indices])
There are no observed outliers.
Here we see that there are no observed outliers in the test set under this model. We now consider the distribution of the residuals and the QQ plot.
residuals_hist(ensemble_test_residuals, bins = 15)
<Figure size 432x288 with 0 Axes>
We note that the distribution has a fatter left tail than a normal distribution, suggesting a slight negative skew. Again, normality is not so important here given we are averaging the Linear Regression Model with a non-linear model.
We examine the QQ plot below.
residuals_qq(ensemble_test_residuals)
<Figure size 432x288 with 0 Axes>
As the histogram suggested, the deviation at both the right and left tail below the line suggests the negative skew we saw. We note that it is broadly normal for the majority of the quantiles, however.
This slight negative skew at the tails suggests that the model is prone to over predicting the value of the weekly_rent.
We now consider the importance of each feature. We examine the coefficients of the Linear Regression model that forms part of the ensemble model, as well as considering the feature importance for the optimised Gradient Boosting model, which forms the other part of our ensemble model.
coefficients = model.coef_
intercept = model.intercept_
feature_names = transformed_X_train.columns
coef_df = pd.DataFrame({
"feature": feature_names,
"coefficient": coefficients
})
print(f'The intercept is {intercept:.4f}.')
for feature in coef_df.feature.tolist():
print(f'The {feature} feature has a coefficient of {coef_df[coef_df["feature"] == feature]["coefficient"].values[0]:.4f}.')
The intercept is -0.0000. The city feature has a coefficient of 0.7124. The operator feature has a coefficient of 0.2278. The beds feature has a coefficient of 0.0286. The age feature has a coefficient of -0.1029. The room_type_Non En-Suite feature has a coefficient of -0.0598. The room_type_One Bed feature has a coefficient of 0.3185. The room_type_Studio feature has a coefficient of 0.3565.
feature_importance(optim_gb_model, transformed_X_train)
<Figure size 432x288 with 0 Axes>
As we saw earlier, the most important feature in both models is clearly the city feature, which is twice as important as any other feature. This is as expected given that different cities have different demand drivers within the PBSA market, such as number of universities, reputability of universities, number of students, current number of PBSA beds, and the resulting supply-demand metrics that result from this. Moreover, even within a wider real estate context, different cities have varying degrees of competing land uses. For example, prime London real estate could be commercial or residential and developers need to see a certain amount of profit for it to be worth them building PBSA. Therefore, rental rates in London need to be higher to ensure that a developer gets an appropriate yield on cost for their development.
The next set of features have similar importance in both models, albeit in slightly different orders : room_type_Studio, room_type_One Bed, and operator. In both models we have that the room_type_Studio is the most important of the three. This makes sense as compared to the base model of an en-suite (given the OHE technique), a studio should garner significantly more rent due to the privacy and sole-use of kitchen it offers the occupant, as well as the increased size. For both models room_type_One Bed was nearly as important, which is for similar reasons regarding the superior product being sold.
The Gradient Boosting model considered operator to be more important than room_type_One Bed, whereas the Linear Regression model had it as less significant with a coefficient of $0.2278$ in comparison to $0.3185$. Either way, both models ascribe importance to it. Contextually, this also makes sense given that more premium operators, with a higher standard of service, are able to attain higher rents. Moreover, more premium operators tend to operate more premium buildings, which have greater amenities, as this is all part of the premium offering. We have not accounted explicitly for amenity quality amongst assets but it would be logical that an asset with a greater offering of amenity could charge higher rents than one with less, all else being equal. However, we note that it having less importance than city or room_type_Studio makes sense, given that irrespective of how premium an operator is, it will be somewhat curtailed by the market dynamics of the city the asset is in and by the nature of the product it is selling. Students are unlikely to pay significantly more for an en-suite from a better operator as a studio is just a far superior product. Moreover, the best operator in Carlisle is going to struggle to charge rents higher than a mid-level operator in central London, due to the nature of the sub-markets the assets are in.
The next most important factor in both models was age of asset. We would expect there to be some correlation between age and weekly_rent as it makes logical sense that newer buildings have better amenity and more curb appeal and therefore can charge higher rents. However, we note that there are more factors at play here. As PBSA has surged in popularity as an asset class, developers sought to follow this popularity resulting in the building boom in PBSA witnessed between 2014 and 2020, as seen in Section 3.3. This surge in PBSA assets resulted in multiple tall buildings being built in city centres, becoming hubs for students. Naturally, local residents may not always favour their area becoming a student hub, resulting in pressure for local governments to restrict the development of PBSA. As a result, councils have made the development of PBSA more difficult with the notable example of the London Plan, with certain London Boroughs, such as Southwark, making it even more difficult to build with large affordable contributions required. This has caused somewhat of a slowdown of PBSA development, further exacerbated by the majority of the most obvious and best sites in proximity to the university having already been developed during the early PBSA building boom. This leaves less appealing sites, which may not be viable, given stringent local government measures. As a result, whilst newer buildings would typically expect to achieve higher rents, we have that this effect is somewhat offset by the older assets tending to be in more prime locations within their local sub-markets and we have already seen how significant location is with regards to rents, as evidenced by the macro-location city feature.
The final two features are beds and room_type_Non En-suite. The Linear Regression model considers beds half as significant as room_type_Non En-Suite, whereas the Gradient Boosting model considers them equally unimportant with room_type_Non En-Suite being marginally less so. Either way, both contribute little to the model. With regards to beds, this makes sense. A larger asset may have more amenity space, but that amenity space is shared with more people, somewhat diluting the effect. Moreover, most students probably don't pay that much attention to the size of the asset unless it is at the extreme end of the spectrum, which probably means for the majority of the assets, beds does not have an impact. With regards to room_type_Non En-Suite, this probably stems from the fact that our box plots in Section 3.3 highlighted - there doesn't appear to be that big of a difference between the en-suite and non en-suite room types from a weekly_rent perspective. Whilst we would expect it to have a slightly negative effect (and it does), it appears to not be as strong as one would think. We have that the gap between a studio and an en-suite is more significant than the gap between an en-suite and a non en-suite, suggesting that students value having to not share a kitchen more than they value having to not share a bathroom. Alternatively, it could be the extra tangible space a studio offers which is the decider, given en-suites and non en-suites feel a similar size due to the en-suite being a separate room and therefore less visible and tangible.
In conclusion, we have discovered that the ensemble model can predict PBSA rental rates using the features provided with a strong degree of accuracy, with an $R^{2}$ score of approximately $85\%$. Moreover, we have that by far the most important feature for predicting rental rates for PBSA is the city. This makes sense as sub-market dynamics are going to drastically influence the rental rates across all real estate asset classes. Moreover, we learned that the age of a building is less significant than one would imagine, with various reasons as for why posited. Finally, we found that the size of an asset has little to contribute to the rents that can be achieved and that the difference between non en-suites and en-suites is not considered incredibly significant by our models either.
In this project we have examined using machine learning methods to model the rental rates achieved in the PBSA market.
We started by importing some data and cleaning it up, deciding on the features we deemed important enough to include in our model. We then performed EDA to ascertain the distribution of our data, which would inform transformations we would make later. We then remedied any data points considered outliers through a mixture of corrections and removals. The first model we wanted to try was Linear Regression due to the log-normality of the target variable and the simplicity of the model. We therefore showed that the data could indeed be transformed to be suitable for such a model. After satisfying this, we moved on to Section 5, where we performed the feature engineering. Here we got our data ready for modelling through a series of transformations and standardisations. Following this, we considered three different types of models - the aforementioned Linear Regression, as well as two non-linear models in Random Forest and Gradient Boosting. After comparing the models, we decided to use a Gradient Boosting model, which we iterated to improve. Following iteration, we considered a further development by averaging the predictions of the optimised Gradient Boosting model with the original Linear Regression model to create an ensemble model. This model was selected going forward. We then evaluated this model before interpreting the results in the context of the PBSA rental market.
We found that PBSA rental data could be modelled accurately using this model with an MSE of $0.0211$ and a $R^{2}$ score of $0.8494$ on the test set, with the model only losing ~$4.5\%$ from the training $R^{2}$ score to the test $R^{2}$ score, evidencing that the model was generalising to the unseen data well. The residual analysis on the test data suggested our residuals were slightly negatively skewed, which was not an overall concern given our ensemble method approach used a non-linear model in the Gradient Boosting method, which somewhat mitigated this stipulation for a simple Linear Regression model.
This model showed that by far the most important feature was the city. Given every city has a different market, this was the distinguishing factor. Other important factors included studio rooms, one-bed rooms and the operator, whilst the other factors had less significant influences. This makes sense as studios and one-beds are a notably superior product to en-suites and the quality of operator and the service level provided varies wildly across the PBSA landscape. We note that in order to achieve the highest rents, one would need to be in London, offering a premium service on an asset with studios and one-beds only. However, we note that this is only part of the equation, with operational and development costs in London being significantly higher and a studio-only asset being able to provide less rooms than a cluster-led approach, meaning that whilst higher rents may be achieved on a room-by-room basis, the bottom line of profitability has not been accounted for here, meaning that, as we see in reality, there is profit to be found across the spectrum of cities, operator quality, and asset mix in the PBSA space and there is not a blanket one-size-fits-all approach.
Going forward, there are several next steps. Having access to larger amounts of data, such as room sizes, could prove influential and well worth exploring if the data was available and accurate. Moreover, on several occasions we found that the sub-market dynamics within a city were more influential than one would imagine. This is particularly prevalent in London, but is likely to be the case across the UK. Having said this, a future study could use the removed postcode data to interpret which assets are close to one another and close to important amenities in an attempt to explore the sub-market dynamics of the PBSA space. This would serve as an excellent complement to this study, which sought to model PBSA rents on a UK-wide level, which was achieved with a reputable degree of success.